Inconsistency when `emmeans()` is used along with `fct_na_value_to_level()`

anna-doizy commented 3 months ago

Hello Russel,

I've submitted an issue here : https://stackoverflow.com/questions/78719144/bug-report-when-emmeans-is-used-along-with-fct-na-value-to-level (I didn't know where to put it at first).

There it is:

# I was dealing with na values in some factors and I decided to use the fct_na_value_to_level() function
warpbreaks2 <- warpbreaks |> 
  dplyr::mutate(
    tension_na_value = forcats::fct_na_level_to_value(tension, "L"), # just to simulate some missing values in my datatset
    tension_na_level = forcats::fct_na_value_to_level(tension_na_value, "missing"), # I want to make the missing values explicit for my later model
    tension_na_level2 = forcats::fct_na_value_to_level(tension_na_value) # this is another try, without renaming the new explicit level
  )

# For your information
levels(warpbreaks2$tension_na_level)
#> [1] "M"       "H"       "missing"
levels(warpbreaks2$tension_na_level2)
#> [1] "M" "H" NA

# First try : everything is ok
lm(breaks ~ wool * tension_na_level, data = warpbreaks2) |> 
  emmeans::emmeans (~ wool | tension_na_level)
#> tension_na_level = M:
#>  wool emmean   SE df lower.CL upper.CL
#>  A      24.0 3.65 48     16.7     31.3
#>  B      28.8 3.65 48     21.4     36.1
#> 
#> tension_na_level = H:
#>  wool emmean   SE df lower.CL upper.CL
#>  A      24.6 3.65 48     17.2     31.9
#>  B      18.8 3.65 48     11.4     26.1
#> 
#> tension_na_level = missing:
#>  wool emmean   SE df lower.CL upper.CL
#>  A      44.6 3.65 48     37.2     51.9
#>  B      28.2 3.65 48     20.9     35.6
#> 
#> Confidence level used: 0.95

# Second try throws an error. the behaviour of emmeans() is not consistant, but maybe it's an issue with how fct_na_value_to_level() is written?
lm(breaks ~ wool * tension_na_level2, data = warpbreaks2) |> 
  emmeans::emmeans (~ wool | tension_na_level2)
#> Error in X[, nm, drop = FALSE]: indice hors limites

I got a quick reply about how ref_grid() is coded, line 80 :

if (is.factor(x) && !(nm %in% coerced$covariates)) 
      xlev[[nm]] = levels(factor(x))

I suppose it is no chance that you decided to use factor(x) instead of x alone? What to you think?

Thank you kindly for your attention (and the existence of your package that I use a lot!)

rvlenth commented 3 months ago

Interesting issue. I had never heard of that function before.

The crux of it is in the code you identified in line 80 of ref_grid. But the call to factor() is needed there for situations where the user specifies a subset of levels, e.g. emmeans(..., at = list(tension = c("L", "H")) because it re-levels the factor, as you can see in the following:

> idx = which(warpbreaks2$tension %in% c("L", "H"))
> levels(warpbreaks2$tension[idx])
[1] "L" "M" "H"
> levels(factor(warpbreaks2$tension[idx]))
[1] "L" "H"

but unfortunately, this affects one of your factors:

> levels(factor(warpbreaks2$tension_na_level[idx]))
[1] "H"       "missing"
> levels(factor(warpbreaks2$tension_na_level2[idx]))
[1] "H"

I believe that line 80 is not the only place that factor() is used to make sure the levels are aligned. And it is also tricky in that ref_grid() also needs to respect the original levels of the factors so it gets the model matrix right.

I honestly don't see a way around this problem,, and suggest that you work around it with explicit re-coding like you did with tension_na_level.

I also would guess that some other packages might have trouble with this kind of thing. The NA code really does mean a missing value, and using NA as a factor level on a par with non-missing levels is really messy.

rvlenth commented 3 months ago

PS -- I will also look at the SO posting, and possibly comment.

rvlenth commented 3 months ago

OK, I have looked at this further. I guess I mis-remember the reason for that factor() call, as things still work fine when I subset the levels. I removed that and one other factor() call, plus made a change to summary(), and got this to work:

> mod2 = lm(breaks ~ wool * tension_na_level2, data = warpbreaks2)
> ### BTW I suggest NOT using pipes in bug reports as we need to talk about the models ###

> emmeans::emmeans (mod2, ~ wool | tension_na_level2)
tension_na_level2 = M:
 wool emmean   SE df lower.CL upper.CL
 A      24.0 3.65 48     16.7     31.3
 B      28.8 3.65 48     21.4     36.1

tension_na_level2 = H:
 wool emmean   SE df lower.CL upper.CL
 A      24.6 3.65 48     17.2     31.9
 B      18.8 3.65 48     11.4     26.1

tension_na_level2 = NA:
 wool emmean   SE df lower.CL upper.CL
 A      44.6 3.65 48     37.2     51.9
 B      28.2 3.65 48     20.9     35.6

Confidence level used: 0.95 

> emmeans::emmeans (mod2, ~ wool | tension_na_level2, at = list(tension_na_level2  = c(NA, "M")))
tension_na_level2 = M:
 wool emmean   SE df lower.CL upper.CL
 A      24.0 3.65 48     16.7     31.3
 B      28.8 3.65 48     21.4     36.1

tension_na_level2 = NA:
 wool emmean   SE df lower.CL upper.CL
 A      44.6 3.65 48     37.2     51.9
 B      28.2 3.65 48     20.9     35.6

Confidence level used: 0.95

I am leery of this, because I put those factor() calls in for a reason. I am worried about breaking things in other situations that used to work, for example maybe a messy dataset where there are incomplete cases. So I implemented an option where we can disable this:

> emm_options(allow.na.levs = FALSE)
> emmeans::emmeans (mod2, ~ wool | tension_na_level2, at = list(tension_na_level2  = c(NA, "M")))
Error in X[, nm, drop = FALSE] : subscript out of bounds

All this said, the modified package passes all tests, including all examples and vignettes. And an example I constructed specifically with the idea of beaking the code works just fine. So maybe we're OK.

You can install the updated package from GitHub (see the website) once I push it up.

anna-doizy commented 3 months ago

Wow, thank you for your work! I'm glad that all your tests passed with the modifications and it's still safer with the option you added.

I got it for the pipe, thanks ;)

I agree that NA factor levels are rather messy, and it surprised me a lot to realize it can happen. Do you think I should report to forcats the same issue, suggesting that a note/warning in the documentation would be welcome? Actually, I'm not sure to use this function again (fct_na_value_to_level), as a good old ifelse(is.na(__),"missing",__) works fine to make missing values explicit.

rvlenth commented 3 months ago

Thanks. I don't really need to report anything to the forcats developer. I'm glad to gave gotten this to work.

rvlenth / emmeans

Inconsistency when `emmeans()` is used along with `fct_na_value_to_level()` #500