rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
364 stars 32 forks source link

Inconsistent results with missing levels since #500 #508

Closed banfai closed 2 months ago

banfai commented 2 months ago

Hello Russel,

I've found this inconsistency described below. I'm still looking into it and I haven't managed to come up with a solution yet, but I wanted to let you know.

Describe the bug

Since the changes in #500 contrasts are generated differently for a dataset with missing levels whether data argument is used in emmeans or not. I suppose without the data argument recover_data() is generating the actually appearing levels, but with data the non-existent level is appearing among the contrasts (level c in the example below).

To reproduce

library(emmeans)
#> Welcome to emmeans.
#> Caution: You lose important information if you filter this package's results.
#> See '? untidy'

x <- data.frame(
  group = factor(c("a", "a", "b", "b"), levels = c("a", "b", "c")),
  value = 1:4
)
x$group
#> [1] a a b b
#> Levels: a b c
m <- lm(value ~ group, data = x)

# default (allow.na.levs = TRUE)
options(emmeans = list())
em1 <- emmeans(m, ~ group)
contrast(em1, "pairwise")
#>  contrast estimate    SE df t.ratio p.value
#>  a - b          -2 0.707  2  -2.828  0.1056
em2 <- emmeans(m, ~ group, data = x)
contrast(em2, "pairwise")
#>  contrast estimate    SE df t.ratio p.value
#>  a - b          -2 0.707  2  -2.828  0.1534
#>  a - c           0 0.000  2     NaN     NaN
#>  b - c           2 0.707  2   2.828  0.1534
#> 
#> P value adjustment: tukey method for comparing a family of 2.56155281280883 estimates

# allow.na.levs = FALSE
options(emmeans = list(allow.na.levs = FALSE))
em3 <- emmeans(m, ~ group)
contrast(em3, "pairwise")
#>  contrast estimate    SE df t.ratio p.value
#>  a - b          -2 0.707  2  -2.828  0.1056
em4 <- emmeans(m, ~ group, data = x)
contrast(em4, "pairwise")
#>  contrast estimate    SE df t.ratio p.value
#>  a - b          -2 0.707  2  -2.828  0.1056

Created on 2024-09-26 with reprex v2.1.1

Expected behavior

I would expect the previous behaviour (i.e. no c contrasts regardless of using data or not).

Additional context

I would suggest having a test for such case (something along the lines of the example).

rvlenth commented 2 months ago

Please look at the NEWS file. There is an option you can set to get the old behavior.

banfai commented 2 months ago

Please look at the NEWS file. There is an option you can set to get tge old behavior.

Thanks for the quick reply. Yes, I've seen that, but I still find it odd that there is now a discrepancy whether data is used or not. I have proposed a solution (based on the comment on Stackoverflow) that would satisfy the requirements in #500 and would still be consistent regardless of using data or recover_data().

As far as I understand, the aim of #500 was not to have contrasts for non-existent levels, and the current behaviour is only an adverse effect of that fix.