Closed anna-doizy closed 3 months ago
Interesting issue. I had never heard of that function before.
The crux of it is in the code you identified in line 80 of ref_grid
. But the call to factor()
is needed there for situations where the user specifies a subset of levels, e.g. emmeans(..., at = list(tension = c("L", "H"))
because it re-levels the factor, as you can see in the following:
> idx = which(warpbreaks2$tension %in% c("L", "H"))
> levels(warpbreaks2$tension[idx])
[1] "L" "M" "H"
> levels(factor(warpbreaks2$tension[idx]))
[1] "L" "H"
but unfortunately, this affects one of your factors:
> levels(factor(warpbreaks2$tension_na_level[idx]))
[1] "H" "missing"
> levels(factor(warpbreaks2$tension_na_level2[idx]))
[1] "H"
I believe that line 80 is not the only place that factor()
is used to make sure the levels are aligned. And it is also tricky in that ref_grid()
also needs to respect the original levels of the factors so it gets the model matrix right.
I honestly don't see a way around this problem,, and suggest that you work around it with explicit re-coding like you did with tension_na_level
.
I also would guess that some other packages might have trouble with this kind of thing. The NA
code really does mean a missing value, and using NA
as a factor level on a par with non-missing levels is really messy.
PS -- I will also look at the SO posting, and possibly comment.
OK, I have looked at this further. I guess I mis-remember the reason for that factor()
call, as things still work fine when I subset the levels. I removed that and one other factor()
call, plus made a change to summary()
, and got this to work:
> mod2 = lm(breaks ~ wool * tension_na_level2, data = warpbreaks2)
> ### BTW I suggest NOT using pipes in bug reports as we need to talk about the models ###
> emmeans::emmeans (mod2, ~ wool | tension_na_level2)
tension_na_level2 = M:
wool emmean SE df lower.CL upper.CL
A 24.0 3.65 48 16.7 31.3
B 28.8 3.65 48 21.4 36.1
tension_na_level2 = H:
wool emmean SE df lower.CL upper.CL
A 24.6 3.65 48 17.2 31.9
B 18.8 3.65 48 11.4 26.1
tension_na_level2 = NA:
wool emmean SE df lower.CL upper.CL
A 44.6 3.65 48 37.2 51.9
B 28.2 3.65 48 20.9 35.6
Confidence level used: 0.95
> emmeans::emmeans (mod2, ~ wool | tension_na_level2, at = list(tension_na_level2 = c(NA, "M")))
tension_na_level2 = M:
wool emmean SE df lower.CL upper.CL
A 24.0 3.65 48 16.7 31.3
B 28.8 3.65 48 21.4 36.1
tension_na_level2 = NA:
wool emmean SE df lower.CL upper.CL
A 44.6 3.65 48 37.2 51.9
B 28.2 3.65 48 20.9 35.6
Confidence level used: 0.95
I am leery of this, because I put those factor()
calls in for a reason. I am worried about breaking things in other situations that used to work, for example maybe a messy dataset where there are incomplete cases. So I implemented an option where we can disable this:
> emm_options(allow.na.levs = FALSE)
> emmeans::emmeans (mod2, ~ wool | tension_na_level2, at = list(tension_na_level2 = c(NA, "M")))
Error in X[, nm, drop = FALSE] : subscript out of bounds
All this said, the modified package passes all tests, including all examples and vignettes. And an example I constructed specifically with the idea of beaking the code works just fine. So maybe we're OK.
You can install the updated package from GitHub (see the website) once I push it up.
Wow, thank you for your work! I'm glad that all your tests passed with the modifications and it's still safer with the option you added.
I got it for the pipe, thanks ;)
I agree that NA
factor levels are rather messy, and it surprised me a lot to realize it can happen. Do you think I should report to forcats the same issue, suggesting that a note/warning in the documentation would be welcome?
Actually, I'm not sure to use this function again (fct_na_value_to_level
), as a good old ifelse(is.na(__),"missing",__)
works fine to make missing values explicit.
Thanks. I don't really need to report anything to the forcats developer. I'm glad to gave gotten this to work.
Hello Russel,
I've submitted an issue here : https://stackoverflow.com/questions/78719144/bug-report-when-emmeans-is-used-along-with-fct-na-value-to-level (I didn't know where to put it at first).
There it is:
I got a quick reply about how
ref_grid()
is coded, line 80 :I suppose it is no chance that you decided to use
factor(x)
instead ofx
alone? What to you think?Thank you kindly for your attention (and the existence of your package that I use a lot!)