Including random effects in a recipe and fit resamples?

Jeffrothschild commented 3 years ago

Hi,

I apologize if this functionality is in-place somewhere already, but can you include the ability to use recipes with a mixed effect model?

I have found a way to do that, but it looks like it cannot be fit using fit_resamples.

For example.

library(multilevelmod)
library(tidymodels)
library(tidyverse)
library(lme4)

testdf <- structure(list(RER = c(0.949, 0.929, 0.935, 0.89, 0.911, 0.85, 
                                 0.868, 0.85, 0.913, 0.969), Time = c(30, 60, 120, 60, 70, 15, 
                                                                      74.711, 50, 140, 40), Intensity = c(58, 72.6, 70, 68, 68.527, 
                                                                                                          60, 70, 45, 70, 70), Study = c("Duhamel 2006", "Helge 1998", 
                                                                                                                                         "Van Zant 1997", "Parcell 1999", "Palmer 1999", "Lane 2015", 
                                                                                                                                         "Helge 1996", "Arkinstall 2004b", "Van Zant 1997", "Tarnopolsky 1995"
                                                                                                          )), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
                                                                                                          )) %>% as.data.frame()

mixed_model_spec <- linear_reg() %>% set_engine("lmer")

mixed_model_fit <- 
  mixed_model_spec %>% 
  fit(RER ~ Time + Intensity  +(1 | Study), data = testdf)

mixed_model_fit 

# That worked well, but creating this as a recipe with a random effect causes an error

basic_recipe <- recipe(RER ~ Time + Intensity + (1 | Study), data = testdf) %>% 
  step_bs(Intensity )

# Alternatively, I can prep/juice the recipe without the random effect:

juiced_recipe <- recipe(RER ~ ., data = testdf) %>% 
  step_bs(Intensity) %>% prep() %>% juice()

#and then run a model like this to take advantage of any recipe steps I might want to add:

test_model <- mixed_model_spec %>% 
  fit(RER ~ . -Study  + (1 | Study), data = juiced_recipe)

test_model 

# But it doesn't seem like there is a way to use fit_resamples on the mixed model spec? I would imagine something like this, which doesn't work because the 'basic_recipe' doesn't run

testfolds <- vfold_cv(testdf, v = 5)
control <- control_resamples(save_pred = TRUE)

model_res <- fit_resamples(mixed_model_spec, basic_recipe, testfolds, control = control)

Thank you

juliasilge commented 3 years ago

To use resampling, you'll need to set up your workflow like this with the special formula argument to add_model(). One thing to keep in mind that your recipe is creating new features:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(tidyverse)

testdf <- structure(list(RER = c(0.949, 0.929, 0.935, 0.89, 0.911, 0.85, 
                                 0.868, 0.85, 0.913, 0.969), 
                         Time = c(30, 60, 120, 60, 70, 15, 
                                  74.711, 50, 140, 40), 
                         Intensity = c(58, 72.6, 70, 68, 68.527, 
                                       60, 70, 45, 70, 70), 
                         Study = c("Duhamel 2006", "Helge 1998", 
                                   "Van Zant 1997", "Parcell 1999", "Palmer 1999", "Lane 2015", 
                                   "Helge 1996", "Arkinstall 2004b", "Van Zant 1997", "Tarnopolsky 1995"
                         )), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
                         )) %>% as.data.frame()

mixed_rec  <- 
  recipe(RER ~ ., data = testdf) %>% 
  step_bs(Intensity) 

prep(mixed_rec) %>% bake(new_data = NULL)
#> # A tibble: 10 × 6
#>     Time Study              RER Intensity_bs_1 Intensity_bs_2 Intensity_bs_3
#>    <dbl> <fct>            <dbl>          <dbl>          <dbl>          <dbl>
#>  1  30   Duhamel 2006     0.949         0.395           0.352          0.104
#>  2  60   Helge 1998       0.929         0               0              1    
#>  3 120   Van Zant 1997    0.935         0.0241          0.232          0.743
#>  4  60   Parcell 1999     0.89          0.0694          0.347          0.579
#>  5  70   Palmer 1999      0.911         0.0557          0.322          0.619
#>  6  15   Lane 2015        0.85          0.340           0.405          0.161
#>  7  74.7 Helge 1996       0.868         0.0241          0.232          0.743
#>  8  50   Arkinstall 2004b 0.85          0               0              0    
#>  9 140   Van Zant 1997    0.913         0.0241          0.232          0.743
#> 10  40   Tarnopolsky 1995 0.969         0.0241          0.232          0.743

^{Created on 2021-11-09 by the reprex package (v2.0.1)}

The special formula used by lmer will not know how to connect those to the original Intensity:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(tidyverse)
library(multilevelmod)

testdf <- structure(list(RER = c(0.949, 0.929, 0.935, 0.89, 0.911, 0.85, 
                                 0.868, 0.85, 0.913, 0.969), 
                         Time = c(30, 60, 120, 60, 70, 15, 
                                  74.711, 50, 140, 40), 
                         Intensity = c(58, 72.6, 70, 68, 68.527, 
                                       60, 70, 45, 70, 70), 
                         Study = c("Duhamel 2006", "Helge 1998", 
                                   "Van Zant 1997", "Parcell 1999", "Palmer 1999", "Lane 2015", 
                                   "Helge 1996", "Arkinstall 2004b", "Van Zant 1997", "Tarnopolsky 1995"
                         )), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
                         )) %>% as.data.frame()

mixed_model_spec <- linear_reg() %>% set_engine("lmer")
mixed_rec  <- recipe(RER ~ ., data = testdf) %>% 
  step_bs(Intensity) 

mixed_model_wf <- workflow() %>%
  add_model(mixed_model_spec, formula = RER ~ Time + Intensity  + (1 | Study)) %>%
  add_recipe(mixed_rec)

fit(mixed_model_wf, testdf)
#> Error in eval(predvars, data, env): object 'Intensity' not found
#> Timing stopped at: 0.002 0 0.003

^{Created on 2021-11-09 by the reprex package (v2.0.1)}

So you'll need to account for that. This is a limitation of using the formula argument in a workflow; there's really no way to send in any flexible selectors. From workflows:

In those cases, a formula specifying the model structure must be passed unchanged into the model call itself.

topepo commented 3 years ago

I believe this can work:

library(splines)
mixed_model_wf <- workflow() %>%
  add_model(mixed_model_spec, formula = RER ~ Time + bs(Intensity)  + (1 | Study)) %>%
  add_recipe(mixed_rec)

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / multilevelmod

Including random effects in a recipe and fit resamples? #16