I think this might be a subtle one. If a training set:
contains a variable that is a factor
the factor knows a value as a potential level
the training set doesn't contain an observation that has that value
When trying to predict with lm on a data set with an observation that has that value, predict() will exit with an error. This actually happened to me with a data set in modeldata.
I learnt about step_novel() and assumed this would be enough to manage this situation. However step_novel() will not do anything if the missing value in the training data set is a known value for the factor (i.e. it's part of the set of levels).
However if I remove the value from the set of levels, predict() will throw a warning, and step_novel() will work. Full reprex below to reproduce this behaviour.
Considerations
I appreciate that there are more profound considerations at play here: I could stratify my data set when splitting it between training and testing, I could reset the levels of the factor to accommodate those in the training data set, etc.
However I also think that there's something more subtle about the expectations on step_novel() behaviour that would make sense for the function to meet, i.e. if a value is not present in the training data set, that value should be transformed into another value such as new.
Alternatively the models supported by tidymodels framework maybe should handle this situation gracefully without an error.
Reproducible example
library(tidyverse)
library(tidymodels)
data(Sacramento)
# Create a training set without ANTELOPE as city value
# and a test set with ANTELOPE as a city value
sacr_tr <- Sacramento %>%
filter(! city %in% c("ANTELOPE"))
sacr_te <- Sacramento %>%
filter(city %in% c("ANTELOPE"))
# Create a workflow that uses step_novel in the recipe, and fit the model
rec <- recipe(
price ~ city,
data = sacr_tr) %>%
step_novel(city)
mod <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(mod)
wf_fit <- wf %>%
fit(sacr_tr)
# The model cannot predict on the test set because it had not seen ANTELOPE before as a value,
# even if ANTELOPE is a level it knows
wf_pred <- wf_fit %>%
predict(sacr_te)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor city has new level ANTELOPE
# Remove ANTELOPE level from city set of levels in the training set
# and refit the model with the resulting training set
sacr_tr_fct <- sacr_tr %>%
mutate(
city = city %>%
as.character() %>%
factor())
rec_fct <- recipe(
price ~ city,
data = sacr_tr_fct) %>%
step_novel(city)
wf_fct <- wf %>%
update_recipe(
rec_fct)
wf_fct_fit <- wf_fct %>%
fit(sacr_tr_fct)
# The model can predict without errors even if it cannot make a prediction
# ANTELOPE level is converted to `new` level and the model can manage it
wf_fct_pred <- wf_fct_fit %>%
predict(sacr_te)
#> Warning: Novel levels found in column 'city': 'ANTELOPE'. The levels have been
#> removed, and values have been coerced to 'NA'.
# If the training set doesn't have ANTELOPE as a level, step_novel can
# transform it to the value `new` as expected
wf_fit %>%
extract_recipe() %>%
bake(sacr_te)
#> # A tibble: 33 × 2
#> city price
#> <fct> <int>
#> 1 ANTELOPE 126640
#> 2 ANTELOPE 161250
#> 3 ANTELOPE 182716
#> 4 ANTELOPE 194818
#> 5 ANTELOPE 387731
#> 6 ANTELOPE 165000
#> 7 ANTELOPE 180000
#> 8 ANTELOPE 200000
#> 9 ANTELOPE 255000
#> 10 ANTELOPE 261000
#> # ℹ 23 more rows
wf_fct_fit %>%
extract_recipe() %>%
bake(sacr_te)
#> # A tibble: 33 × 2
#> city price
#> <fct> <int>
#> 1 new 126640
#> 2 new 161250
#> 3 new 182716
#> 4 new 194818
#> 5 new 387731
#> 6 new 165000
#> 7 new 180000
#> 8 new 200000
#> 9 new 255000
#> 10 new 261000
#> # ℹ 23 more rows
The problem
I think this might be a subtle one. If a training set:
When trying to predict with
lm
on a data set with an observation that has that value,predict()
will exit with an error. This actually happened to me with a data set inmodeldata
.I learnt about
step_novel()
and assumed this would be enough to manage this situation. Howeverstep_novel()
will not do anything if the missing value in the training data set is a known value for the factor (i.e. it's part of the set of levels).However if I remove the value from the set of levels,
predict()
will throw a warning, andstep_novel()
will work. Full reprex below to reproduce this behaviour.Considerations
I appreciate that there are more profound considerations at play here: I could stratify my data set when splitting it between training and testing, I could reset the levels of the factor to accommodate those in the training data set, etc.
However I also think that there's something more subtle about the expectations on
step_novel()
behaviour that would make sense for the function to meet, i.e. if a value is not present in the training data set, that value should be transformed into another value such asnew
.Alternatively the models supported by
tidymodels
framework maybe should handle this situation gracefully without an error.Reproducible example
Created on 2023-10-31 with reprex v2.0.2
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22) #> os macOS Big Sur/Monterey 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Madrid #> date 2023-10-31 #> pandoc 3.1.9 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> broom * 1.0.4 2023-03-11 [1] CRAN (R 4.2.0) #> class 7.3-21 2023-01-23 [1] CRAN (R 4.2.0) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0) #> codetools 0.2-19 2023-02-01 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.0) #> dials * 1.2.0 2023-04-03 [1] CRAN (R 4.2.0) #> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0) #> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0) #> fs 1.6.2 2023-04-25 [1] CRAN (R 4.2.0) #> furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.0) #> future 1.32.0 2023-03-07 [1] CRAN (R 4.2.0) #> future.apply 1.10.0 2022-11-05 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.2.0) #> globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gower 1.0.1 2022-12-22 [1] CRAN (R 4.2.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.0) #> gtable 0.3.3 2023-03-21 [1] CRAN (R 4.2.0) #> hardhat 1.3.0 2023-03-30 [1] CRAN (R 4.2.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.0) #> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.0) #> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0) #> ipred 0.9-14 2023-03-09 [1] CRAN (R 4.2.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.0) #> lava 1.7.2.1 2023-02-27 [1] CRAN (R 4.2.0) #> lhs 1.1.6 2022-12-17 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> listenv 0.9.0 2022-12-16 [1] CRAN (R 4.2.0) #> lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> MASS 7.3-59 2023-04-21 [1] CRAN (R 4.2.0) #> Matrix 1.5-4 2023-04-04 [1] CRAN (R 4.2.0) #> modeldata * 1.2.0 2023-08-09 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> nnet 7.3-18 2022-09-28 [1] CRAN (R 4.2.0) #> parallelly 1.35.0 2023-03-23 [1] CRAN (R 4.2.0) #> parsnip * 1.1.0 2023-04-12 [1] CRAN (R 4.2.0) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> prodlim 2023.03.31 2023-04-02 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.0) #> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.2.0) #> recipes * 1.0.6 2023-04-25 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0) #> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.0) #> rpart 4.1.19 2022-10-21 [1] CRAN (R 4.2.0) #> rsample * 1.2.0 2023-08-23 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.2.0) #> survival 3.5-5 2023-03-12 [1] CRAN (R 4.2.0) #> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.2.0) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0) #> timeDate 4022.108 2023-01-07 [1] CRAN (R 4.2.0) #> tune * 1.1.1 2023-04-11 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> workflows * 1.1.3 2023-02-22 [1] CRAN (R 4.2.0) #> workflowsets * 1.0.1 2023-04-06 [1] CRAN (R 4.2.0) #> xfun 0.39 2023-04-20 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> yardstick * 1.2.0 2023-04-21 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```