rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
364 stars 32 forks source link

Wrong weights with nuisance factors #510

Closed vaever closed 1 month ago

vaever commented 1 month ago

Issue

Nuisance factors seem to be averaged incorrectly, when non-equal weights are used to calculate emmeans. In the following model

fit <- lm(data = data, formula = mpg ~ hp + cyl + gear)

the factors cyl and gear do not interact with other factors. One should therefore get the same result whether they are passed as nuisance or non.nuisance factors to emmeans, when calculating the mean of other factors. You only get that when using equal weights, but not for other type of weights. See the example below.

I have verified that this behaviour appeared first in version 1.8.9, in previous versions all four calls gave the same (and expected) result.

library(emmeans)
packageVersion('emmeans')

data <- mtcars  
data$cyl <- as.factor(data$cyl)
data$gear <- as.factor(data$gear)

fit <- lm(data = data,
          formula = mpg ~ hp + cyl + gear)

wgt <- 'prop'
# --- All four should give the same result
emmeans(fit, specs = ~ hp, weights = wgt)
emmeans(fit, specs = ~ hp, weights = wgt, nuisance = "cyl")
emmeans(fit, specs = ~ hp, weights = wgt, nuisance = "gear")
emmeans(fit, specs = ~ hp, weights = wgt, nuisance = c("gear", "cyl"))
> library(emmeans)
> packageVersion('emmeans')
[1] ‘1.8.9’
> 
> data <- mtcars  
> data$cyl <- as.factor(data$cyl)
> data$gear <- as.factor(data$gear)
> 
> fit <- lm(data = data,
+           formula = mpg ~ hp + cyl + gear)
> 
> wgt <- 'prop'
> # --- All four should give the same result
> emmeans(fit, specs = ~ hp, weights = wgt)
  hp emmean    SE df lower.CL upper.CL
 147   20.1 0.513 26       19     21.1

Results are averaged over the levels of: cyl, gear 
Confidence level used: 0.95 
> emmeans(fit, specs = ~ hp, weights = wgt, nuisance = "cyl")
  hp emmean    SE df lower.CL upper.CL
 147   21.1 0.821 26     19.4     22.8

Results are averaged over the levels of: 1 nuisance factors, gear 
Confidence level used: 0.95 
> emmeans(fit, specs = ~ hp, weights = wgt, nuisance = "gear")
  hp emmean  SE df lower.CL upper.CL
 147   19.6 1.2 26     17.1       22

Results are averaged over the levels of: 1 nuisance factors, cyl 
Confidence level used: 0.95 
> emmeans(fit, specs = ~ hp, weights = wgt, nuisance = c("gear", "cyl"))
  hp emmean    SE df lower.CL upper.CL
 147   20.1 0.513 26       19     21.1

Results are averaged over the levels of: 2 nuisance factors 
Confidence level used: 0.95 

I have verified that if you change wgt <- 'equal' all four statement give the same (and expected) result. Also if you install version 1.8.8 or earlier, all four statements give the same (and expected) result.

I hope that the issue can be resolved.

Thank you for a great package.

rvlenth commented 1 month ago

Thanks for reporting this. As I recall, you also should add the argument wt.nuis = wgt for those where nuisance factors are used -- but that doesn't seem to make a difference. Anyway, even with that added, I don't get the same result for all four calls, which we should.

I have some hints as to what is going wrong, but need to trace it back to where the code changed that made it stop working. The only recent issue that seems related is #503, but that was a very simple change and I don't see how that explains it. I'll keep you apprised.

rvlenth commented 1 month ago

There was a subtle error involving failure to exclude nuisance factors from the grid. For example, if you do

ref_grid(fit, nuisance = "cyl", wt.nuis = "prop") @grid

you will see that cyl is still in the grid, even though it shouldn't be. Its presence causes the wrong weights to be present in the .wgt. column, and that distorts the results going forward from there.

After fixing this, here is what we get with your tests

library(emmeans)
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
packageVersion('emmeans')
## [1] '1.10.4.900003'

data <- mtcars  
data$cyl <- as.factor(data$cyl)
data$gear <- as.factor(data$gear)

fit <- lm(data = data,
          formula = mpg ~ hp + cyl + gear)

wgt <- 'prop'
# --- All four should give the same result
emmeans(fit, specs = ~ hp, weights = wgt)
##   hp emmean    SE df lower.CL upper.CL
##  147   20.1 0.513 26       19     21.1
## 
## Results are averaged over the levels of: cyl, gear 
## Confidence level used: 0.95
emmeans(fit, specs = ~ hp, weights = wgt, nuisance = "cyl")
##   hp emmean    SE df lower.CL upper.CL
##  147   20.1 0.513 26       19     21.1
## 
## Results are averaged over the levels of: 1 nuisance factors, gear 
## Confidence level used: 0.95
emmeans(fit, specs = ~ hp, weights = wgt, nuisance = "gear")
##   hp emmean    SE df lower.CL upper.CL
##  147   20.1 0.513 26       19     21.1
## 
## Results are averaged over the levels of: 1 nuisance factors, cyl 
## Confidence level used: 0.95
emmeans(fit, specs = ~ hp, weights = wgt, nuisance = c("gear", "cyl"))
##   hp emmean    SE df lower.CL upper.CL
##  147   20.1 0.513 26       19     21.1
## 
## Results are averaged over the levels of: 2 nuisance factors 
## Confidence level used: 0.95

Created on 2024-10-02 with reprex v2.1.1

I'll push this up and then you can install from GitHub or wait for the next CRAN update.

rvlenth commented 1 month ago

PS you may disregard my comment about wt.nuis. You do indeed need that with ref_grid(), but emmeans() passes its weights argument as wt.nuis when it makes its own call to ref_grid(). There are so many details that I don't remember...

vaever commented 1 month ago

Thank you for taking care of it so fast.