tidymodels / parsnip

A tidy unified interface to models
https://parsnip.tidymodels.org
Other
598 stars 89 forks source link

Recipes not available with gen_additive_mod() #849

Closed sweiner123 closed 1 year ago

sweiner123 commented 1 year ago

The problem

I'm trying to add a recipe to a workflow with a gen_additive_mod. The Generalized additive models via mgcv article articulates a certain method of adding a formula to the workflow to ensure spline terms are internalized by the model. But I would like to use recipes to preprocess my data, and it seems that the add_model( ... , formula = .... ) method is incompatible with recipes. Is there something I'm missing or do I have to preprocess the data in a different way? Thank you in advance for your help!

Reproducible example

``` r
library(tidymodels)
tidymodels_prefer()

data("Chicago")

n <- nrow(Chicago)

Chicago_train <- Chicago[1:(n - 7), ]
Chicago_test <- Chicago[(n - 6):n, ]

gam_spec <- 
  gen_additive_mod() %>% 
  set_engine("mgcv") %>% 
  set_mode("regression")

gam_recipe <- 
  recipe(ridership ~ Clark_Lake + Quincy_Wells + humidity, data = Chicago_train) %>% 
  step_zv()

gam_workfrow <- 
  workflow() %>% 
  add_recipe(gam_recipe) %>% 
  add_variables(outcomes = c("ridership"), predictors = c("Clark_Lake", "Quincy_Wells", "humidity")) %>% 
  add_model(gam_spec , formula = ridership ~ Clark_Lake + Quincy_Wells + s(humidity))
#> Error in `add_variables()`:
#> ! Variables cannot be added when a recipe already exists.

#> Backtrace:
#>      ▆
#>   1. ├─... %>% ...
#>   2. ├─workflows::add_model(...)
#>   3. │ └─workflows:::add_action(x, action, "model")
#>   4. │   └─workflows:::validate_is_workflow(x, call = call)
#>   5. │     └─workflows:::is_workflow(x)
#>   6. └─workflows::add_variables(...)
#>   7.   └─workflows:::add_action(x, action, "variables")
#>   8.     ├─workflows:::check_conflicts(action, x, call = call)
#>   9.     └─workflows:::check_conflicts.action_variables(action, x, call = call)
#>  10.       └─rlang::abort(...)

Created on 2022-12-04 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31 ucrt) #> os Windows 10 x64 (build 19044) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.utf8 #> ctype English_United States.utf8 #> tz America/New_York #> date 2022-12-04 #> pandoc 2.19.2 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.2) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.2) #> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.2) #> class 7.3-20 2022-01-16 [2] CRAN (R 4.2.2) #> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.2) #> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.2) #> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.2) #> conflicted 1.1.0 2021-11-26 [1] CRAN (R 4.2.2) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.2) #> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.2) #> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.2) #> digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.2) #> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.2) #> evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.2) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.2) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.2) #> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.2) #> future 1.29.0 2022-11-06 [1] CRAN (R 4.2.2) #> future.apply 1.10.0 2022-11-05 [1] CRAN (R 4.2.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2) #> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.2) #> globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.2) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2) #> gower 1.0.0 2022-02-03 [1] CRAN (R 4.2.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.2) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.2) #> hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.2.2) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.2) #> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.2) #> infer * 1.0.3 2022-08-22 [1] CRAN (R 4.2.2) #> ipred 0.9-13 2022-06-02 [1] CRAN (R 4.2.2) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.2) #> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.2) #> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.2) #> lava 1.7.0 2022-10-25 [1] CRAN (R 4.2.2) #> lhs 1.1.5 2022-03-22 [1] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2) #> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.2) #> lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2) #> MASS 7.3-58.1 2022-08-03 [2] CRAN (R 4.2.2) #> Matrix 1.5-1 2022-09-13 [2] CRAN (R 4.2.2) #> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.2) #> modeldata * 1.0.1 2022-09-06 [1] CRAN (R 4.2.2) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.2) #> nnet 7.3-18 2022-09-28 [2] CRAN (R 4.2.2) #> parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.1) #> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.2) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2) #> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.2.2) #> purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.2) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2) #> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.2) #> recipes * 1.0.3 2022-11-09 [1] CRAN (R 4.2.2) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2) #> rmarkdown 2.18 2022-11-09 [1] CRAN (R 4.2.2) #> rpart 4.1.19 2022-10-21 [2] CRAN (R 4.2.2) #> rsample * 1.1.0 2022-08-08 [1] CRAN (R 4.2.2) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.2) #> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2) #> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1) #> stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.2) #> styler 1.8.1 2022-11-07 [1] CRAN (R 4.2.2) #> survival 3.4-0 2022-08-09 [2] CRAN (R 4.2.2) #> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.2) #> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.2) #> tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.2) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2) #> timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.2) #> timeDate 4021.106 2022-09-30 [1] CRAN (R 4.2.1) #> tune * 1.0.1 2022-10-09 [1] CRAN (R 4.2.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2) #> vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2) #> workflows * 1.1.0 2022-09-26 [1] CRAN (R 4.2.2) #> workflowsets * 1.0.0 2022-07-12 [1] CRAN (R 4.2.2) #> xfun 0.34 2022-10-18 [1] CRAN (R 4.2.2) #> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.1) #> yardstick * 1.1.0 2022-09-07 [1] CRAN (R 4.2.2) #> #> [1] C:/software/Rpackages #> [2] C:/Program Files/R/R-4.2.2/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
hfrick commented 1 year ago

👋 Each workflow can take one preprocessor and one model. The options for preprocessor are a formula (with add_formula()), a recipe (with add_recipe(), or "just" listing variables (with add_variables()).

So in your case, you can use only one of add_recipe() or add_variables().

The formula argument to add_model() captures the model formula passed on to mgcv, which is why you specify the splines treatment for humidity there (as you have done).

So I think what you want is simply to remove the call to add_variables():

library(tidymodels)
tidymodels_prefer()

data("Chicago")

n <- nrow(Chicago)

Chicago_train <- Chicago[1:(n - 7), ]
Chicago_test <- Chicago[(n - 6):n, ]

gam_spec <- 
  gen_additive_mod() %>% 
  set_engine("mgcv") %>% 
  set_mode("regression")

gam_recipe <- 
  recipe(ridership ~ Clark_Lake + Quincy_Wells + humidity, data = Chicago_train) %>% 
  step_zv()

gam_workflow <- 
  workflow() %>% 
  add_recipe(gam_recipe) %>% 
  add_model(gam_spec , formula = ridership ~ Clark_Lake + Quincy_Wells + s(humidity))

fit(gam_workflow, data = Chicago_train)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: gen_additive_mod()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_zv()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> ridership ~ Clark_Lake + Quincy_Wells + s(humidity)
#> 
#> Estimated degrees of freedom:
#> 1  total = 4 
#> 
#> GCV score: 9.416346

Created on 2022-12-05 with reprex v2.0.2

simonpcouch commented 1 year ago

Possibly related to #770. :)

sweiner123 commented 1 year ago

Thank you so much for the clarification! Works like charm now.

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.