rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
358 stars 31 forks source link

Using at=list() to specify values for a continuous predictor messes up averaging across other variables #401

Closed rpmady closed 1 year ago

rpmady commented 1 year ago

Describe the bug

When I calculate emmeans using at=list(), results are averaged across one other predictor instead of all the predictors.

To reproduce

I have run a glm model using the glmmTMB package with five predictors. Two are binary categorical variables (factor1, factor2), one is a proportion (prop) , and the other two are continuous (covariate1, covariate2).

In the model I have the interaction term categorical1*continuous1, and I want to calculate the emmeans, so I used at=list().

emmeans(model, ~ factor1* covariate1, at=list(covariate1=c(1,2,3,4,5)), type = "response") The output shows me emmeanas with "Results are averaged over the levels of: factor2"

Expected behavior

I expected the emmean results to be averaged over the rest of the predictor variables. I could only get the emmeans to be averaged over the other predictor variables by using the following code:

emmeans(model, ~ factor1* covariate1, cov.reduce = range, type = "response") The output showed me emmeans with "Results are averaged over factor2, prop, covariate2

From the documentation and vignettes, I expected the list to just allow me to get emmeans at multiple values for the continuous variable, and not exclude averaging across the other variables.

rvlenth commented 1 year ago

Please take a look at a similar example:

require(emmeans)
## Loading required package: emmeans

# This example uses the 'mtcars' dataset provided with R
mod <- lm(mpg ~ factor(cyl) * disp, data = mtcars)

emmeans(mod, ~ disp * cyl, at = list(disp = c(50, 100, 200, 400)))
##  disp cyl emmean    SE df lower.CL upper.CL
##    50   4   34.1 1.697 26    30.63    37.60
##   100   4   27.4 0.729 26    25.86    28.86
##   200   4   13.8 2.742 26     8.21    19.48
##   400   4  -13.2 8.260 26   -30.16     3.79
##    50   6   19.3 3.232 26    12.62    25.91
##   100   6   19.4 2.138 26    15.05    23.84
##   200   6   19.8 0.977 26    17.79    21.81
##   400   6   20.5 5.127 26     9.99    31.06
##    50   8   21.1 3.009 26    14.87    27.24
##   100   8   20.1 2.537 26    14.85    25.28
##   200   8   18.1 1.615 26    14.79    21.43
##   400   8   14.2 0.780 26    12.58    15.78
## 
## Confidence level used: 0.95

emmeans(mod, "cyl", at = list(disp = c(50, 100, 200, 400)))
## NOTE: Results may be misleading due to involvement in interactions
##  cyl emmean    SE df lower.CL upper.CL
##    4   15.5 2.407 26     10.6     20.5
##    6   19.8 0.902 26     17.9     21.6
##    8   18.4 1.728 26     14.8     21.9
## 
## Results are averaged over the levels of: disp 
## Confidence level used: 0.95

Created on 2023-02-08 with reprex v2.0.2

The first emmeans() call is like yours, in that it asks for combinations of the factor levels and the specified covariate values. Note that there is indeed no annotation that means were averaged over that covariate, precisely because they weren't. All the combinations of those two variables are shown, not averaged-over.

The second emmeans() call just specifies the one factor, and the output does average over the four disp values provided, and it says so in the annotations below the output.

Maybe you meant to set levels for covariate2 instead of/as well? Anyway, I don't see a bug here.

rvlenth commented 1 year ago

Oh, another discrepancy is your cov.reduce = range. That affects both covariates. So your emmeans() call with at but not cov.reduce will set covariate2 at its mean (one value), rather than its min and max (2 values). Based on this, the comparison case would be

emmeans(model, ~ factor1* covariate1, at=list(covariate1=c(1,2,3,4,5)), cov.reduce = range, type = "response")
rpmady commented 1 year ago

Thanks for demonstrating that with the mtcars dataset. I'll clarify what I meant by also using the mtcars dataset.

I modified the model in the example you provided: mod <- lm(mpg ~ factor(cyl) * disp + hp + drat, data = mtcars)

When I calculate the emmeans at specific values of disp, I expected to see in the output that it was averaged over the other predictor variables (hp and drat) but I don't:

emmeans(mod, ~ disp * cyl, at = list(disp = c(50, 100, 200, 400)))
## cyl emmean    SE df lower.CL upper.CL
##  4   15.0 2.446 24     9.93     20.0
##  6   19.4 0.954 24    17.39     21.3
##   8   19.5 1.927 24    15.54     23.5
##Results are averaged over the levels of: disp 
##Confidence level used: 0.95 

Are the other variables hp and drat being set at their mean and it is just not shown?

rvlenth commented 1 year ago

hp and drat are not in the model, and play no role whatsoever in the analysis.

rpmady commented 1 year ago

Apologies, I'm not following. I did put hp and drat in the example I just posted. mod <- lm(mpg ~ factor(cyl) * disp + hp + drat, data = mtcars)

When calculating emmeans, do you need to also include them like this: emmeans(mod, ~ disp * cyl + hp + drat, at = list(disp = c(50, 100, 200, 400)))

rvlenth commented 1 year ago

Sorry, I was trying to answer too quickly. What emmeans() does is based on the reference grid:

> mod <- lm(mpg ~ factor(cyl) * disp + hp + drat, data = mtcars) 
> ref_grid(mod, at = list(disp = c(50, 100, 200, 400)))
'emmGrid' object with variables:
    cyl = 4, 6, 8
    disp =  50, 100, 200, 400
    hp = 146.69
    drat = 3.5966

When you have covariates, they are by default reduced to their average. Hence only one value of those covariates, hence they are not "averaged over." If you reduce them to two or more values, e.g.

> ref_grid(mod, at = list(disp = c(50, 100, 200, 400)), cov.reduce = range)
'emmGrid' object with variables:
    cyl = 4, 6, 8
    disp =  50, 100, 200, 400
    hp =  52, 335
    drat = 2.76, 4.93

then they do show up in the annotations of what is averaged over

> emmeans(mod, "cyl", at = list(disp = c(50, 100, 200, 400)), cov.reduce = range)
NOTE: Results may be misleading due to involvement in interactions
 cyl emmean   SE df lower.CL upper.CL
   4   14.4 2.60 24     9.05     19.8
   6   18.8 1.25 24    16.21     21.4
   8   19.0 1.89 24    15.06     22.9

Results are averaged over the levels of: disp, hp, drat 
Confidence level used: 0.95 
rpmady commented 1 year ago

Okay, I follow 😊 Thank you for clarifying and for your patience (especially since this was not a bug afterall, and my own misunderstanding).