Fix handling of factors

ericward-noaa commented 1 year ago

There's a funny situation where if data are filtered, but contain original factor names, the coefficients are estimated for the original names / levels instead of those in the data passed in. For example, here's a simple data frame with 3 levels, and can model them as characters or factors. In the third case I've added an extra level (e.g. perhaps data was filtered out in pre-processing)

df <- data.frame(y = rnorm(100),
                 a_char = sample(c("a","b"), size=100, replace=T))
df$a_fac <- as.factor(df$a_char)
df$a_extra_fac = factor(df$a_fac, levels = c("a","b","c"))

Existing lm/glm/glmmTMB packages all fit these models fine, and don't try to estimate the coefficient for "c"

lm(y ~ a_char, data = df)
lm(y ~ a_fac, data = df)
lm(y ~ a_extra_fac, data = df)

sdmTMB works fine with the first 2 examples using a factor or character. But for the case of the extra level, it tries to estimate the coefficient (and won't converge!)

sdmTMB(y ~ a_char, data = df, spatial="off")
sdmTMB(y ~ a_fac, data = df, spatial="off")
sdmTMB(y ~ a_extra_fac, data = df, spatial="off")

I'll look into how lm/glmm/glmmTMB process these cases

ericward-noaa commented 1 year ago

I think the fix of this is easy. glmmTMB has copied glm in adding this drop.unused.levels to model frames, e.g.

glmmTMB: https://github.com/glmmTMB/glmmTMB/blob/0dffb36351116e68328a0c31bf49ee99f0c92b6b/glmmTMB/R/glmmTMB.R#L1016 glm: https://github.com/SurajGupta/r-source/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/glm.R#L50

But we haven't added that. Needs to be done after the model frame is created, https://github.com/pbs-assess/sdmTMB/blob/f344e1fbb60497487cd53c9079482842f2ea4791/R/fit.R#L810

ericward-noaa commented 1 year ago

Actually - there's an easier solution than messing with the model.frame. Base R has the droplevels() function, which strips levels from a dataframe. Propose we just use this function on the inputted df, e.g.

droplevels(df)

seananderson commented 1 year ago

Are we sure droplevels() will behave the same in edge conditions like predicting with missing factor levels or new factor levels and that this will behave the same as glm? Maybe we should just use the same approach as glm? And perhaps we should add unit tests to make sure behaviour with prediction edge cases behaves the same as glm?

On the other hand, if we can show that prediction edge cases behave the same then I guess either approach is fine.

ericward-noaa commented 1 year ago

Yeah -- I had done some testing earlier, confirming that droplevels() behaves similarly to glm. I added those to the tests to test-random-intercepts.R, and seems to be behaving as we expect

https://github.com/pbs-assess/sdmTMB/blob/9481696c0f3abb6cca603e5cdff372811b0513f8/tests/testthat/test-random-intercepts.R#L116

pbs-assess / sdmTMB

Fix handling of factors #184