handle factors automatically

hansvancalster commented 5 years ago

Thanks for this fantastic package which is really helpful for running spatial models with INLA.

The handling of factors is, however, tedious and looks like something that could/should be handled behind the scenes.

The current workflow is something like this:

model.matrix(~myfactor, mydata) %>% # create dummy variable for myfactor
as.data.frame() %>%
select(-1) %>% # drop intercept
bind_cols(mydata) -> mydata

m1_inlabru <- bru(y ~ myfactorlevel1 + myfactorlevel2 + myfactorlevel3, data = goose,
family = "nbinomial")

It would be really helpful and user-friendly if this could be shortened to:

m1_inlabru <- bru(y ~ myfactor, data = goose,
family = "nbinomial")

I suspect there are technical reasons that hinder this kind of implementation for more complex models, but I'd like to ask it anyway.

finnlindgren commented 5 years ago

Thanks for the comments and report! Yes, there are technical reasons inlabru currently exposes more of the issues inherent in factor variables than INLA does, and moving that behind the scenes is very much on our TODO-list. Part of this is due to an INLA issue where NAs in a factor variable are refused, but the inla.stack construction, that in inlabru happens behind the scenes, means that there will always be NAs in the factor variables when handed to inla. This is worked around by telling inla to use factor.strategy=“inla”, which essentially turns each factor level into an indicator dummy, without ensuring identifiability of the full model. Ensuring identifiability is left entirely to the user in such models.

One issue is what the code should actually do. Perhaps the most straightforward would be to always treat the first factor level as reference level, which I think is close to the default INLA behaviour. Model.matrix itself tries to be clever and consider all factor variables at once, but it’s difficult to generalize that to the more general predictor expressions inlabru supports, so the enforced reference level might be the best option. Suggestions and opinions, and examples are welcome!

finnlindgren commented 5 years ago

Also, the backend-clean branch version might actually handle this differently to the CRAN version. If you’re able to test it on that branch, please let us know. backend-clean will hopefully soon become the new release, with a largely rewritten backend to make it easier to support the full range of inla models.

hansvancalster commented 5 years ago

Thanks for the explanations and good to know it is on the TODO list! I think most users would indeed expect default behavior of model.matrix() as this is what happens behind the scenes of lm() and the like. I am not yet familiar enough with the more general predictor expressions that you mention - so I have no opinion on that.

If I have time I try the backend-clean branch. Are you referring to the possibility to use f(..., model = "factor") in that branch?

hrue commented 5 years ago

Thanks for the comments and report! Yes, there are technical reasons inlabru currently exposes more of the issues inherent in factor variables than INLA does, and moving that behind the scenes is very much on our TODO-list. Part of this is due to an INLA issue where NAs in a factor variable are refused, but the inla.stack construction, that in inlabru happens behind the scenes, means that there will always be NAs in the factor variables when handed to inla.

I think this is also related to how R interprets NA in a factor, as you can tell that NA is a level, then it will go through.

a = addNA(as.factor(c(1,2,3,NA))) y = rep(0,length(a)) r=inla(y~-1+a, data = list(y=y,a=a)) r$model.matrix "dsparseModelMatrix": 4 x 4 sparse Matrix of class "dgCMatrix" a1 a2 a3 aNA 1 1 . . . 2 . 1 . . 3 . . 1 . 4 . . . 1 @ assign: 1 1 1 1 @ contrasts: $a [1] "contr.treatment"

a = as.factor(c(1,2,3,NA)) r=inla(y~-1+a, data = list(y=y,a=a)) r$model.matrix "dsparseModelMatrix": 4 x 3 sparse Matrix of class "dgCMatrix" a1 a2 a3 1 1 . . 2 . 1 . 3 . . 1 4 . . . @ assign: 1 1 1 @ contrasts: $a [1] "contr.treatment"

finnlindgren commented 1 year ago

There are now two methods for factor components: Individual component:

name(input, model="factor_full") # Factor input, one coefficient per level
name(input, model="factor_contrast") # Factor input, first level is dropped

General fixed effect model:

name(~ some R formula expression, model = "fixed")

This method allows more flexibility as well as interactions, e.g.

name(~ 0 + x + x:y, model = "fixed")

inlabru-org / inlabru

handle factors automatically #62