tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
570 stars 113 forks source link

step_novel() doesn't work for a value not seen on training data factor if they're a factor level #1249

Open bcadenato opened 1 year ago

bcadenato commented 1 year ago

The problem

I think this might be a subtle one. If a training set:

When trying to predict with lm on a data set with an observation that has that value, predict() will exit with an error. This actually happened to me with a data set in modeldata.

I learnt about step_novel() and assumed this would be enough to manage this situation. However step_novel() will not do anything if the missing value in the training data set is a known value for the factor (i.e. it's part of the set of levels).

However if I remove the value from the set of levels, predict() will throw a warning, and step_novel() will work. Full reprex below to reproduce this behaviour.

Considerations

I appreciate that there are more profound considerations at play here: I could stratify my data set when splitting it between training and testing, I could reset the levels of the factor to accommodate those in the training data set, etc.

However I also think that there's something more subtle about the expectations on step_novel() behaviour that would make sense for the function to meet, i.e. if a value is not present in the training data set, that value should be transformed into another value such as new.

Alternatively the models supported by tidymodels framework maybe should handle this situation gracefully without an error.

Reproducible example

library(tidyverse)
library(tidymodels)

data(Sacramento)

# Create a training set without ANTELOPE as city value 
# and a test set with ANTELOPE as a city value

sacr_tr <- Sacramento %>% 
    filter(! city %in% c("ANTELOPE"))

sacr_te <- Sacramento %>% 
    filter(city %in% c("ANTELOPE"))

# Create a workflow that uses step_novel in the recipe, and fit the model

rec <- recipe(
    price ~ city,
    data = sacr_tr) %>% 
    step_novel(city)

mod <- linear_reg() %>% 
    set_engine("lm") %>% 
    set_mode("regression")

wf <- workflow() %>% 
    add_recipe(rec) %>% 
    add_model(mod)

wf_fit <- wf %>% 
    fit(sacr_tr)

# The model cannot predict on the test set because it had not seen ANTELOPE before as a value, 
# even if ANTELOPE is a level it knows

wf_pred <- wf_fit %>% 
    predict(sacr_te)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor city has new level ANTELOPE

# Remove ANTELOPE level from city set of levels in the training set
# and refit the model with the resulting training set

sacr_tr_fct <- sacr_tr %>% 
    mutate(
        city = city %>% 
            as.character() %>% 
            factor())

rec_fct <- recipe(
    price ~ city,
    data = sacr_tr_fct) %>% 
    step_novel(city)

wf_fct <- wf %>% 
    update_recipe(
        rec_fct)

wf_fct_fit <- wf_fct %>% 
    fit(sacr_tr_fct)

# The model can predict without errors even if it cannot make a prediction
# ANTELOPE level is converted to `new` level and the model can manage it

wf_fct_pred <- wf_fct_fit %>% 
    predict(sacr_te)
#> Warning: Novel levels found in column 'city': 'ANTELOPE'. The levels have been
#> removed, and values have been coerced to 'NA'.

# If the training set doesn't have ANTELOPE as a level, step_novel can
# transform it to the value `new` as expected

wf_fit %>% 
    extract_recipe() %>% 
    bake(sacr_te)
#> # A tibble: 33 × 2
#>    city      price
#>    <fct>     <int>
#>  1 ANTELOPE 126640
#>  2 ANTELOPE 161250
#>  3 ANTELOPE 182716
#>  4 ANTELOPE 194818
#>  5 ANTELOPE 387731
#>  6 ANTELOPE 165000
#>  7 ANTELOPE 180000
#>  8 ANTELOPE 200000
#>  9 ANTELOPE 255000
#> 10 ANTELOPE 261000
#> # ℹ 23 more rows

wf_fct_fit %>% 
    extract_recipe() %>% 
    bake(sacr_te)
#> # A tibble: 33 × 2
#>    city   price
#>    <fct>  <int>
#>  1 new   126640
#>  2 new   161250
#>  3 new   182716
#>  4 new   194818
#>  5 new   387731
#>  6 new   165000
#>  7 new   180000
#>  8 new   200000
#>  9 new   255000
#> 10 new   261000
#> # ℹ 23 more rows

Created on 2023-10-31 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22) #> os macOS Big Sur/Monterey 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Madrid #> date 2023-10-31 #> pandoc 3.1.9 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.2.0) #> class 7.3-21 2023-01-23 [1] CRAN (R 4.2.0) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0) #> codetools 0.2-19 2023-02-01 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.0) #> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.2.0) #> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0) #> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0) #> fs 1.6.2 2023-04-25 [1] CRAN (R 4.2.0) #> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.0) #> future 1.32.0 2023-03-07 [1] CRAN (R 4.2.0) #> future.apply 1.10.0 2022-11-05 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.2.0) #> globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gower 1.0.1 2022-12-22 [1] CRAN (R 4.2.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.0) #> gtable 0.3.3 2023-03-21 [1] CRAN (R 4.2.0) #> hardhat 1.3.0 2023-03-30 [1] CRAN (R 4.2.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.0) #> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.0) #> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0) #> ipred 0.9-14 2023-03-09 [1] CRAN (R 4.2.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.0) #> lava 1.7.2.1 2023-02-27 [1] CRAN (R 4.2.0) #> lhs 1.1.6 2022-12-17 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> listenv 0.9.0 2022-12-16 [1] CRAN (R 4.2.0) #> lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> MASS 7.3-59 2023-04-21 [1] CRAN (R 4.2.0) #> Matrix 1.5-4 2023-04-04 [1] CRAN (R 4.2.0) #> modeldata * 1.2.0 2023-08-09 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> nnet 7.3-18 2022-09-28 [1] CRAN (R 4.2.0) #> parallelly 1.35.0 2023-03-23 [1] CRAN (R 4.2.0) #> parsnip * 1.1.0 2023-04-12 [1] CRAN (R 4.2.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> prodlim 2023.03.31 2023-04-02 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.0) #> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.2.0) #> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0) #> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.0) #> rpart 4.1.19 2022-10-21 [1] CRAN (R 4.2.0) #> rsample * 1.2.0 2023-08-23 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.2.0) #> survival 3.5-5 2023-03-12 [1] CRAN (R 4.2.0) #> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.2.0) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0) #> timeDate 4022.108 2023-01-07 [1] CRAN (R 4.2.0) #> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.2.0) #> workflowsets * 1.0.1 2023-04-06 [1] CRAN (R 4.2.0) #> xfun 0.39 2023-04-20 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
EmilHvitfeldt commented 1 year ago

thanks for reporting! that does appear to be a bug, or at least the wrong way to handle this situation. We will look into it