Support for multilevel formula

cimentadaj commented 4 years ago

Hi!

Thanks for hardhat and your valuable work for the community!

I'm interesting in integrating hardhat with multilevel models but I'm not sure the current behavior of hardhat is expected. I've read the vignette and I see that it supports formulas such as cbind(y1, y2) ~ x1 appropriately but it doesn't know how to handle multilevel syntax. It creates new variables for (Days | Subject) when in fact they should be normal predictors. See reprex:

library(hardhat)
library(lme4)
data(sleepstudy, package = "lme4")
hardhat::mold(Reaction ~ Days + (Days | Subject), data = sleepstudy)
#> Warning in Ops.factor(Days, Subject): '|' not meaningful for factors
#> $predictors
#> # A tibble: 180 x 3
#>     Days `Days | SubjectFALSE` `Days | SubjectTRUE`
#>    <dbl>                 <dbl>                <dbl>
#>  1     0                    NA                   NA
#>  2     1                    NA                   NA
#>  3     2                    NA                   NA
#>  4     3                    NA                   NA
#>  5     4                    NA                   NA
#>  6     5                    NA                   NA
#>  7     6                    NA                   NA
#>  8     7                    NA                   NA
#>  9     8                    NA                   NA
#> 10     9                    NA                   NA
#> # … with 170 more rows
#> 
#> $outcomes
#> # A tibble: 180 x 1
#>    Reaction
#>       <dbl>
#>  1     250.
#>  2     259.
#>  3     251.
#>  4     321.
#>  5     357.
#>  6     415.
#>  7     382.
#>  8     290.
#>  9     431.
#> 10     466.
#> # … with 170 more rows
#> 
#> $blueprint
#> Formula blueprint: 
#>  
#> # Predictors: 2 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#> Novel Levels: FALSE 
#>   Indicators: TRUE 
#> 
#> $extras
#> $extras$offset
#> NULL

lme4 has support for extracing multilevel notation, so perhaps it could be integrated via lme4. For example: lme4::findbars(Reaction ~ Days + (Days | Subject)).

DavisVaughan commented 4 years ago

It looks like you are using this in combination with workflows / parsnip in https://github.com/cimentadaj/tidyflow/issues/15

hardhat's mold function just uses basic model.matrix() infrastructure with a few tweaks. It won't handle "special" formula syntax like multilevel formulas because there are just too many types to support and it would rely on too many other packages.

The correct way to pass the formula through when you are using it with workflows is to pass it through add_model(formula = ). Then you use add_formula() or add_recipe() to only specify terms.

It is currently easiest to do this with add_recipe() because add_formula() will try and expand all factors to dummy variables for multilevel models unless you specify indicators = "none" in the blueprint. This will get more intuitive with add_variables(y = Reaction, x = c(Days, Subject)) in https://github.com/tidymodels/workflows/issues/34

library(hardhat)
library(lme4)
library(multilevelmod)
library(workflows)
library(recipes)

data(sleepstudy, package = "lme4")

mixed_model_spec <- linear_reg() %>% set_engine("lmer")

# //////////////////////////////////////////////////////////////////////////////

# Tell hardhat not to expand factors into dummy variables
bp <- default_formula_blueprint(indicators = "none")

# We just use `add_formula()` to specify terms, then `add_model()` contains
# the real "model formula"
wf <- workflow() %>%
  add_formula(Reaction ~ Days + Subject, blueprint = bp) %>%
  add_model(mixed_model_spec, formula = Reaction ~ Days + (Days | Subject))

fit(wf, sleepstudy)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Reaction ~ Days + Subject
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Reaction ~ Days + (Days | Subject)
#>    Data: data
#> REML criterion at convergence: 1743.628
#> Random effects:
#>  Groups   Name        Std.Dev. Corr
#>  Subject  (Intercept) 24.741       
#>           Days         5.922   0.07
#>  Residual             25.592       
#> Number of obs: 180, groups:  Subject, 18
#> Fixed Effects:
#> (Intercept)         Days  
#>      251.41        10.47

# //////////////////////////////////////////////////////////////////////////////

# 0-step recipe to just specify terms
rec <- recipe(Reaction ~ Days + Subject, sleepstudy)

wf2 <- workflow() %>%
  add_recipe(rec) %>%
  add_model(mixed_model_spec, formula = Reaction ~ Days + (Days | Subject))

fit(wf2, sleepstudy)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Reaction ~ Days + (Days | Subject)
#>    Data: data
#> REML criterion at convergence: 1743.628
#> Random effects:
#>  Groups   Name        Std.Dev. Corr
#>  Subject  (Intercept) 24.741       
#>           Days         5.922   0.07
#>  Residual             25.592       
#> Number of obs: 180, groups:  Subject, 18
#> Fixed Effects:
#> (Intercept)         Days  
#>      251.41        10.47

DavisVaughan commented 4 years ago

add_variables() is in the development version of workflows. Otherwise I don't think there is much else to do here

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / hardhat

Support for multilevel formula #147