rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
364 stars 32 forks source link

emtrends odd behavior with nuisance variables #448

Closed jnugentkp closed 1 year ago

jnugentkp commented 1 year ago

Describe the bug

Hi Russ - thanks for your work on this great package. I have recently been using the emmeans() function on categorical variables in a model with 50+ variables and 250K+ observations, so employing the "nuisance" argument to simplify the grid has been important to that workflow. I recently tried the same approach using the emtrends() function with a continuous variable. It works as expected when I don't include any nuisance variables, but when including them, the output doesn't make sense to me (has only one group when it should have two; has an odd trend estimate). See example below. I can't tell if I'm misunderstanding how emtrends() handles nuisance arguments, or whether it's a bug. Thanks! Josh Nugent Kaiser Permanente Division of Research

To reproduce

library(emmeans)

set.seed(1)
n <- 5000
A <- rnorm(n)
x1 <- rnorm(n)
x2 <- rbinom(n, size = 2, prob = .25)
group <- rbinom(n = n, size = 1, prob = .5)

slope0 <- .25
slope1 <- .5
noise <- rnorm(n)

y0 <- slope0*A + group + noise
y1 <- slope1*A + group + noise
y <- ifelse(group, y1, y0)

group_fact <- as.factor(group)

mod <- lm(y ~ group_fact*A + x1 + x2)
summary(mod)

# Nuisance parameters that happen to be noise as well
W <- c("x1", "x2")

# Works in simulated data without specifying nuisance set,
# but not with large data set with many nuisance variables...
emtrends(object = mod, 
         specs = ~ group_fact*A,
         var = "A")

# These are the correct estimates of the group-level slopes
# group_fact        A A.trend     SE   df lower.CL upper.CL
# 0          -0.00319   0.243 0.0193 4994    0.205    0.281
# 1          -0.00319   0.490 0.0197 4994    0.451    0.529

# Does not work as expected
emtrends(object = mod, 
                specs = ~ group_fact*A,
                var = "A", nuisance = W)

# group_fact        A A.trend   SE   df lower.CL upper.CL
# 0          -0.00319     135 3.79 4994      128      143
# Results are averaged over the levels of: 2 nuisance factors 
# Confidence level used: 0.95

Expected behavior

I expect that the slopes should be around .25 for the "0" group and .5 for the "1" group, as they are when I exclude the nuisance argument.

rvlenth commented 1 year ago

Thanks for reporting this. Clearly there's a problem, and I'll look into it.

PS --- It's a little confusing to include A in the specs, since its value isn't material to the estimates. We get less confusing output by leaving it out:

> emtrends(object = mod, 
+          specs = ~ group_fact,
+          var = "A")
 group_fact A.trend     SE   df lower.CL upper.CL
 0            0.243 0.0193 4994    0.205    0.281
 1            0.490 0.0197 4994    0.451    0.529

Confidence level used: 0.95 
rvlenth commented 1 year ago

In the meantime, it is possible to do it by manually computing the difference quotients, even with nuisance factors included:

> rg = ref_grid(mod, at = list(A = c(-.5,.5)), nuis = W)
> confint(contrast(emmeans(rg, ~A|group_fact), "consec"), by = NULL)
 contrast       group_fact estimate     SE   df lower.CL upper.CL
 A0.5 - (A-0.5) 0             0.243 0.0193 4994    0.200    0.286
 A0.5 - (A-0.5) 1             0.490 0.0197 4994    0.446    0.534

Results are averaged over the levels of: 2 nuisance factors 
Confidence level used: 0.95 
Conf-level adjustment: mvt method for 2 estimates 
jnugentkp commented 1 year ago

Thanks for that workaround - very helpful. I'll use that in the meantime and stay tuned to see if there's an update at some point. Many thanks-

rvlenth commented 1 year ago

It was a bookkeeping error. I made a fix to the code for nuisance factors so that the grid slot is consistent with the linfct slot. That alone actually fixes it, but in the meantime I had made a fix to emtrends so it counts the rows in linfct instead of grid. That also fixed the bug.

> emtrends(mod, "group_fact", "A", nuis = W)
 group_fact A.trend     SE   df lower.CL upper.CL
 0            0.243 0.0193 4994    0.205    0.281
 1            0.490 0.0197 4994    0.451    0.529

Results are averaged over the levels of: 2 nuisance factors 
Confidence level used: 0.95 
rvlenth commented 1 year ago

PS you can install it from GitHub (see how on the main code page) and it should work

jnugentkp commented 1 year ago

Great, thanks so much! I figured it was a something small. Glad to hear it was an easy fix. I'll use the GitHub installation method and move forward. Have a great weekend.