tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
564 stars 111 forks source link

Using a recipe that includes the target and tuning #1280

Closed SantiagoD999 closed 7 months ago

SantiagoD999 commented 7 months ago

Good morning, I am using your excellent tidymodels package and have came across an issue. Thank you.

The problem

I'm having trouble with using recipes when the objective is to scale both the features and the target, choosing a model after tuning a parameter. I want to use the scaled target both for the tuning and subsequent prediction. I have used skip=TRUE but this then makes the tuning process ignore that I want the target to be scaled.

Reproducible example

library(tidymodels)

set.seed(1)
n<-1000
y<-numeric(n)
x1<-rnorm(n)
x2<-rnorm(n)
x3<-rnorm(n)
x4<-rnorm(n)
e<-rnorm(n)

y<-2000+3*x1+8*x2+9*x3+8*x4+e

DATA<-tibble(y,x1,x2,x3,x4)

TRAIN<-DATA[1:800,]
TEST<-DATA[801:n,]

recipe_norm<-recipe(y~.,data=TRAIN) %>%
  step_normalize(all_outcomes()) %>%
  step_normalize(all_predictors())

mlp_norm<-workflow() %>%
  add_model(mlp(epochs=tune()) %>% set_engine("nnet") %>% set_mode("regression")) %>%
  add_recipe(recipe_norm) 

mlp_resample<-vfold_cv(TRAIN, v = 5)
mlp_tune <- tune_grid(mlp_norm, mlp_resample, grid = 5,control=control_grid(save_pred = TRUE),metrics = metric_set(rmse))

mlp_norm_fit<-mlp_norm |>
  finalize_workflow(select_best(mlp_tune, "rmse")) |>
  fit(TRAIN)

predict(mlp_norm_fit,new_data=TEST)

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       macOS Catalina 10.15.5
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Bogota
#>  date     2024-01-19
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.2.2)
#>  clipr         0.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.2.0)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.2.2)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.2.2)
#>  knitr         1.45    2023-10-30 [1] CRAN (R 4.2.2)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.2.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.3  2023-11-18 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang         1.1.2   2023-11-04 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.2.2)
#>  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.2.2)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  styler        1.10.2  2023-08-29 [1] CRAN (R 4.2.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.2.2)
#>  withr         3.0.0   2024-01-16 [1] CRAN (R 4.2.2)
#>  xfun          0.41    2023-11-01 [1] CRAN (R 4.2.2)
#>  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.2.2)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
EmilHvitfeldt commented 7 months ago

Hello @SantiagoD999 👋

At prediction time in {workflows} only variables marked as "predictor" will be added to the preprocessor, such as recipes. If you really want to scale the outcome, you can scale it separately beforehand.

library(tidymodels)

set.seed(1)
n<-1000
y<-numeric(n)
x1<-rnorm(n)
x2<-rnorm(n)
x3<-rnorm(n)
x4<-rnorm(n)
e<-rnorm(n)

y<-2000+3*x1+8*x2+9*x3+8*x4+e

DATA<-tibble(y,x1,x2,x3,x4)

TRAIN<-DATA[1:800,]
TEST<-DATA[801:n,]

recipe_outcome <- recipe(y~., data=TRAIN) %>%
  step_normalize(all_outcomes()) %>%
  prep()

TRAIN <- bake(recipe_outcome, TRAIN)
# This is only needed because we know TEST has an outcome.
# On future data this will not be needed
TEST <- bake(recipe_outcome, TEST)

recipe_norm<-recipe(y~.,data=TRAIN) %>%
  step_normalize(all_predictors())

mlp_norm<-workflow() %>%
  add_model(mlp(epochs=tune()) %>% set_engine("nnet") %>% set_mode("regression")) %>%
  add_recipe(recipe_norm) 

mlp_resample<-vfold_cv(TRAIN, v = 5)
mlp_tune <- tune_grid(mlp_norm, mlp_resample, grid = 5,control=control_grid(save_pred = TRUE),metrics = metric_set(rmse))

mlp_norm_fit<-mlp_norm |>
  finalize_workflow(select_best(mlp_tune, "rmse")) |>
  fit(TRAIN)

augment(mlp_norm_fit,new_data=TEST)
#> # A tibble: 200 × 6
#>         x1     x2       x3      x4       y   .pred
#>      <dbl>  <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
#>  1 -1.09    0.714 -2.10    -0.728  -1.49   -1.38  
#>  2 -1.83    0.581 -0.0844  -0.247  -0.149  -0.219 
#>  3  0.995  -0.147  0.756   -0.614   0.208   0.227 
#>  4 -0.0119  1.51  -1.58     0.104   0.0199 -0.0561
#>  5 -0.600  -0.280  0.707   -0.801  -0.277  -0.257 
#>  6 -0.178   2.03  -1.05     1.32    1.08    1.07  
#>  7 -0.426  -1.20   0.259    0.0312 -0.474  -0.518 
#>  8  0.997   1.31  -0.00168 -0.824   0.499   0.436 
#>  9  0.728  -0.524 -1.18    -0.867  -1.16   -1.24  
#> 10 -1.73    0.354  1.74    -1.25    0.230   0.188 
#> # ℹ 190 more rows
SantiagoD999 commented 7 months ago

Thank you for your response. I have another question, wouldn't outcome's scaling beforehand produce data leakage in the resample process as I would be using the outcome's entire training set instead of the available fold in the standardization?

EmilHvitfeldt commented 7 months ago

Good question. I don't think it is. Data leakage is when informations about the relationship between outcome and predictors is improperly handled. So that the model appears to have better performance than it actually has.

Adding and dividing the outcome by the same numbers in all parts of the pipeline doesn't do that. imagine adding 10 to the outcome, it should have no effect on the performance.

SantiagoD999 commented 7 months ago

I see your point, thank you for the clarification.

EmilHvitfeldt commented 7 months ago

No problem!

github-actions[bot] commented 7 months ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.