tidymodels / multilevelmod

Parsnip wrappers for mixed-level and hierarchical models
https://multilevelmod.tidymodels.org/
Other
74 stars 3 forks source link

Combining Feature selection and mixed effects models: Error in eval(predvars, data, env): object ' ' not found #63

Closed AlbertoImg closed 4 months ago

AlbertoImg commented 5 months ago

Hi developers,

I would like to know how to update the formula used in a workflow object during the fitting step, after a feature selection (FS) pre-processing step was performed. The current issues is that when running the fit function I get "Error in eval(predvars, data, env) : object 'Var1' not found". It happens since the FS extracted that predictor, but the formula still considered it. I had to add the formula using add_model, since I am working with a linear mixed effect for classification, and so far I could not find a way to set the random effects ("ID") in a recipe object.

Case example:

predictors_list <- c("Var1", "Var2")
recipe <- recipe(dataset)%>% ... %>% step_select_boruta(all_predictors(), outcome = "Disease")

 recipe <-
    recipe %>% add_role("ID", new_role = "predictor")
    mixed_effects_formula <- as.formula(
      paste(
        "Disease ~ ",
        paste(c(predictors_list, "(1|ID)"), collapse = " + ")
      )
    )
    wflow <- workflow() %>%
      add_recipe(recipe) %>%
      add_model(model, formula = mixed_effects_formula)

  fitted_model <-
    fit(
      wflow,
      data = data_train
    )
"Error in eval(predvars, data, env) : object 'Var1' not found"

Thanks in advance Any help it is really appreciated Best Alberto

hfrick commented 5 months ago

Thanks for including an example! I can't run it, though, so this is a bit of a general reply: if your feature selection step changes the names of the predictors, e.g., through transformations, you can't use the original predictor name in the model formula (that's the one that you use in add_model()) because they will not be there anymore after the preprocessing.

You could use the dot notation in the formula, i.e., something similar to Disease ~ . + (1|ID) - ID. You'd need to ensure only variables you want to use in the model are left after preprocessing and remove the fixed effect for the ID variable. Here is an illustration of that idea

library(tidymodels)
library(multilevelmod)

data(sleepstudy, package = "lme4")
# we want to use the formula Reaction ~ Days + (1|Subject)

lmer_spec <- 
  linear_reg() %>% 
  set_engine("lmer")

# recipe here without any further preprocessing/feature engineering
# because the data already only contains the 3 variables we are going to use
rec <- recipe(Reaction ~ ., sleepstudy) 

wflow <- workflow() %>% 
  add_recipe(rec) %>% 
  add_model(lmer_spec, formula = Reaction ~ . -Subject + (1|Subject))

fit(wflow, data = sleepstudy)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Reaction ~ . - Subject + (1 | Subject)
#>    Data: data
#> REML criterion at convergence: 1786.465
#> Random effects:
#>  Groups   Name        Std.Dev.
#>  Subject  (Intercept) 37.12   
#>  Residual             30.99   
#> Number of obs: 180, groups:  Subject, 18
#> Fixed Effects:
#> (Intercept)         Days  
#>      251.41        10.47

# same fit as with Reaction ~ Days + (1|Subject)
lmer_spec %>% 
  fit(Reaction ~ Days + (1|Subject), data = sleepstudy)
#> parsnip model object
#> 
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Reaction ~ Days + (1 | Subject)
#>    Data: data
#> REML criterion at convergence: 1786.465
#> Random effects:
#>  Groups   Name        Std.Dev.
#>  Subject  (Intercept) 37.12   
#>  Residual             30.99   
#> Number of obs: 180, groups:  Subject, 18
#> Fixed Effects:
#> (Intercept)         Days  
#>      251.41        10.47

Created on 2024-01-24 with reprex v2.0.2

AlbertoImg commented 5 months ago

Hi hfrick, Thanks for your answer! I tried it and the fit function is working with my dataset and workflow (as your example). However, then when I use augment or predict I got the following error: fitted_model <- fit( current_workflow, data = data_training ) prediction_fold <- fitted_model %>% augment(new_data_fold)

Error in terms.formula(ff) : '.' in formula and no 'data' argument

I tried this but I get the same: fitted_model %>% augment(data=data_training, new_data=new_data_fold)

Thanks again Best Alberto

hfrick commented 5 months ago

I can't help you with this one without a proper reprex. Please check out the reprex package for easily making those and the article on dos and donts. Github issues are best used for bug reports and feature requests; for general help in how to get a piece of code to run, Posit Community is the best place, also because more people see your question there and can chime in.

AlbertoImg commented 5 months ago

Thank a lot for your support! I will see if in Posit Community I can get some tips as well. Best Alberto