tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
573 stars 113 forks source link

step_novel won't work after step_unknown: bug or feature #494

Closed gsimchoni closed 3 years ago

gsimchoni commented 4 years ago

Hi,

(Love this package and all the work you guys are doing)

Using the reproducible example from the docs, it works:

library(modeldata)
data(okc)

okc_tr <- okc[1:30000,]
okc_te <- okc[30001:30006,]
okc_te$diet[3] <- "cannibalism"
okc_te$diet[4] <- "vampirism"

rec <- recipe(~ diet + location, data = okc_tr)

rec <- rec %>%
  step_novel(diet, location)
rec <- prep(rec, training = okc_tr)

processed <- bake(rec, okc_te)
tibble(old = okc_te$diet, new = processed$diet)
# A tibble: 6 x 2
  old               new              
  <chr>             <fct>            
1 vegetarian        vegetarian       
2 strictly anything strictly anything
3 cannibalism       new              
4 vampirism         new              
5 NA                NA               
6 NA                NA

Doing step_novel() then step_unknown() also works as expected:

rec <- recipe(~ diet + location, data = okc_tr)

rec <- rec %>%
    step_novel(diet, location) %>% step_unknown(diet, location)
rec <- prep(rec, training = okc_tr)

processed <- bake(rec, okc_te)
tibble(old = okc_te$diet, new = processed$diet)
# A tibble: 6 x 2
  old               new              
  <chr>             <fct>            
1 vegetarian        vegetarian       
2 strictly anything strictly anything
3 cannibalism       new              
4 vampirism         new              
5 NA                unknown          
6 NA                unknown

But step_novel() after step_unknown()...

rec <- recipe(~ diet + location, data = okc_tr)

rec <- rec %>%
    step_unknown(diet, location) %>% step_novel(diet, location)
rec <- prep(rec, training = okc_tr)

processed <- bake(rec, okc_te)
tibble(old = okc_te$diet, new = processed$diet)
# A tibble: 6 x 2
  old               new              
  <chr>             <fct>            
1 vegetarian        vegetarian       
2 strictly anything strictly anything
3 cannibalism       NA               
4 vampirism         NA               
5 NA                unknown          
6 NA                unknown

If this is a bug - OK, if this is a feature and I'm missing something, could you please explain? Thanks.

> sessioninfo::session_info()
- Session info ----------------------------------------------------------------------
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_Israel.1252         
 ctype    English_Israel.1252         
 tz       Asia/Jerusalem              
 date     2020-04-17                  

- Packages --------------------------------------------------------------------------
 package       * version    date       lib source        
 assertthat      0.2.1      2019-03-21 [1] CRAN (R 3.6.1)
 backports       1.1.6      2020-04-05 [1] CRAN (R 3.6.3)
 base64enc       0.1-3      2015-07-28 [1] CRAN (R 3.6.0)
 bayesplot       1.7.1      2019-12-01 [1] CRAN (R 3.6.1)
 boot            1.3-22     2019-04-02 [2] CRAN (R 3.6.1)
 broom         * 0.5.2      2019-04-07 [1] CRAN (R 3.6.1)
 callr           3.4.3      2020-03-28 [1] CRAN (R 3.6.3)
 cellranger      1.1.0      2016-07-27 [1] CRAN (R 3.6.1)
 class           7.3-15     2019-01-01 [2] CRAN (R 3.6.1)
 cli             2.0.2      2020-02-28 [1] CRAN (R 3.6.3)
 codetools       0.2-16     2018-12-24 [2] CRAN (R 3.6.1)
 colorspace      1.4-1      2019-03-18 [1] CRAN (R 3.6.1)
 colourpicker    1.0        2017-09-27 [1] CRAN (R 3.6.1)
 crayon          1.3.4      2017-09-16 [1] CRAN (R 3.6.1)
 crosstalk       1.0.0      2016-12-21 [1] CRAN (R 3.6.1)
 data.table      1.12.8     2019-12-09 [1] CRAN (R 3.6.3)
 desc            1.2.0      2018-05-01 [1] CRAN (R 3.6.1)
 dials         * 0.0.6      2020-04-03 [1] CRAN (R 3.6.1)
 DiceDesign      1.8-1      2019-07-31 [1] CRAN (R 3.6.1)
 digest          0.6.25     2020-02-23 [1] CRAN (R 3.6.3)
 dplyr         * 0.8.5      2020-03-07 [1] CRAN (R 3.6.3)
 DT              0.7        2019-06-11 [1] CRAN (R 3.6.1)
 dygraphs        1.1.1.6    2018-07-11 [1] CRAN (R 3.6.1)
 ellipsis        0.3.0      2019-09-20 [1] CRAN (R 3.6.1)
 fansi           0.4.1      2020-01-08 [1] CRAN (R 3.6.3)
 farver          2.0.3      2020-01-16 [1] CRAN (R 3.6.3)
 forcats       * 0.4.0      2019-02-17 [1] CRAN (R 3.6.1)
 foreach         1.5.0      2020-03-30 [1] CRAN (R 3.6.3)
 furrr           0.1.0      2018-05-16 [1] CRAN (R 3.6.1)
 future          1.14.0     2019-07-02 [1] CRAN (R 3.6.1)
 generics        0.0.2      2018-11-29 [1] CRAN (R 3.6.1)
 ggmosaic      * 0.2.0      2018-09-12 [1] CRAN (R 3.6.1)
 ggplot2       * 3.3.0      2020-03-05 [1] CRAN (R 3.6.3)
 ggrepel         0.8.1      2019-05-07 [1] CRAN (R 3.6.1)
 ggridges        0.5.1      2018-09-27 [1] CRAN (R 3.6.1)
 glmnet          3.0-1      2019-11-15 [1] CRAN (R 3.6.1)
 globals         0.12.4     2018-10-11 [1] CRAN (R 3.6.0)
 glue          * 1.4.0      2020-04-03 [1] CRAN (R 3.6.3)
 gower           0.2.1      2019-05-14 [1] CRAN (R 3.6.0)
 GPfit           1.0-8      2019-02-08 [1] CRAN (R 3.6.2)
 gridExtra       2.3        2017-09-09 [1] CRAN (R 3.6.1)
 gtable          0.3.0      2019-03-25 [1] CRAN (R 3.6.1)
 gtools          3.8.1      2018-06-26 [1] CRAN (R 3.6.0)
 haven           2.1.1      2019-07-04 [1] CRAN (R 3.6.1)
 hms             0.5.2      2019-10-30 [1] CRAN (R 3.6.1)
 htmltools       0.3.6      2017-04-28 [1] CRAN (R 3.6.1)
 htmlwidgets     1.3        2018-09-30 [1] CRAN (R 3.6.1)
 httpuv          1.5.1      2019-04-05 [1] CRAN (R 3.6.1)
 httr            1.4.1      2019-08-05 [1] CRAN (R 3.6.1)
 igraph          1.2.4.1    2019-04-22 [1] CRAN (R 3.6.1)
 infer         * 0.5.0      2019-09-27 [1] CRAN (R 3.6.1)
 inline          0.3.15     2018-05-18 [1] CRAN (R 3.6.1)
 ipred           0.9-9      2019-04-28 [1] CRAN (R 3.6.1)
 iterators       1.0.12     2019-07-26 [1] CRAN (R 3.6.1)
 janeaustenr     0.1.5      2017-06-10 [1] CRAN (R 3.6.1)
 jsonlite        1.6        2018-12-07 [1] CRAN (R 3.6.1)
 knitr           1.23       2019-05-18 [1] CRAN (R 3.6.1)
 labeling        0.3        2014-08-23 [1] CRAN (R 3.6.0)
 later           1.0.0      2019-10-04 [1] CRAN (R 3.6.1)
 lattice         0.20-38    2018-11-04 [2] CRAN (R 3.6.1)
 lava            1.6.7      2020-03-05 [1] CRAN (R 3.6.3)
 lazyeval        0.2.2      2019-03-15 [1] CRAN (R 3.6.1)
 lhs             1.0.1      2019-02-03 [1] CRAN (R 3.6.2)
 lifecycle       0.2.0      2020-03-06 [1] CRAN (R 3.6.3)
 listenv         0.7.0      2018-01-21 [1] CRAN (R 3.6.1)
 lme4            1.1-21     2019-03-05 [1] CRAN (R 3.6.1)
 loo             2.1.0      2019-03-13 [1] CRAN (R 3.6.1)
 lubridate       1.7.8      2020-04-06 [1] CRAN (R 3.6.3)
 magrittr        1.5        2014-11-22 [1] CRAN (R 3.6.1)
 markdown        1.0        2019-06-07 [1] CRAN (R 3.6.1)
 MASS            7.3-51.4   2019-03-31 [2] CRAN (R 3.6.1)
 Matrix          1.2-17     2019-03-22 [2] CRAN (R 3.6.1)
 matrixStats     0.55.0     2019-09-07 [1] CRAN (R 3.6.1)
 mime            0.7        2019-06-11 [1] CRAN (R 3.6.0)
 miniUI          0.1.1.1    2018-05-18 [1] CRAN (R 3.6.1)
 minqa           1.2.4      2014-10-09 [1] CRAN (R 3.6.1)
 modeldata     * 0.0.1      2019-12-06 [1] CRAN (R 3.6.1)
 modelr          0.1.5      2019-08-08 [1] CRAN (R 3.6.1)
 munsell         0.5.0      2018-06-12 [1] CRAN (R 3.6.1)
 naniar        * 0.4.2      2019-02-15 [1] CRAN (R 3.6.2)
 nlme            3.1-140    2019-05-12 [2] CRAN (R 3.6.1)
 nloptr          1.2.1      2018-10-03 [1] CRAN (R 3.6.1)
 nnet            7.3-12     2016-02-02 [2] CRAN (R 3.6.1)
 packrat         0.5.0      2018-11-14 [1] CRAN (R 3.6.1)
 parsnip       * 0.0.4      2019-11-02 [1] CRAN (R 3.6.1)
 pillar          1.4.3      2019-12-20 [1] CRAN (R 3.6.3)
 pkgbuild        1.0.6      2019-10-09 [1] CRAN (R 3.6.3)
 pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 3.6.1)
 pkgload         1.0.2      2018-10-29 [1] CRAN (R 3.6.1)
 plotly          4.9.0      2019-04-10 [1] CRAN (R 3.6.1)
 plyr            1.8.4      2016-06-08 [1] CRAN (R 3.6.1)
 prettyunits     1.1.1      2020-01-24 [1] CRAN (R 3.6.3)
 pROC            1.15.3     2019-07-21 [1] CRAN (R 3.6.1)
 processx        3.4.2      2020-02-09 [1] CRAN (R 3.6.3)
 prodlim         2019.11.13 2019-11-17 [1] CRAN (R 3.6.3)
 productplots    0.1.1      2016-07-02 [1] CRAN (R 3.6.1)
 promises        1.0.1      2018-04-13 [1] CRAN (R 3.6.1)
 ps              1.3.2      2020-02-13 [1] CRAN (R 3.6.3)
 purrr         * 0.3.3      2019-10-18 [1] CRAN (R 3.6.1)
 R6              2.4.1      2019-11-12 [1] CRAN (R 3.6.1)
 Rcpp            1.0.4.6    2020-04-09 [1] CRAN (R 3.6.3)
 readr         * 1.3.1      2018-12-21 [1] CRAN (R 3.6.1)
 readxl          1.3.1      2019-03-13 [1] CRAN (R 3.6.1)
 recipes       * 0.1.10     2020-03-18 [1] CRAN (R 3.6.3)
 reshape2        1.4.3      2017-12-11 [1] CRAN (R 3.6.1)
 rlang           0.4.5      2020-03-01 [1] CRAN (R 3.6.3)
 rpart           4.1-15     2019-04-12 [2] CRAN (R 3.6.1)
 rprojroot       1.3-2      2018-01-03 [1] CRAN (R 3.6.1)
 rsample       * 0.0.5      2019-07-12 [1] CRAN (R 3.6.1)
 rsconnect       0.8.15     2019-07-22 [1] CRAN (R 3.6.1)
 rstan           2.19.2     2019-07-09 [1] CRAN (R 3.6.1)
 rstanarm        2.19.2     2019-10-03 [1] CRAN (R 3.6.1)
 rstantools      2.0.0      2019-09-15 [1] CRAN (R 3.6.1)
 rstudioapi      0.11       2020-02-07 [1] CRAN (R 3.6.3)
 rvest           0.3.4      2019-05-15 [1] CRAN (R 3.6.1)
 scales        * 1.1.0      2019-11-18 [1] CRAN (R 3.6.3)
 sessioninfo     1.1.1      2018-11-05 [1] CRAN (R 3.6.1)
 shape           1.4.4      2018-02-07 [1] CRAN (R 3.6.0)
 shiny           1.3.2      2019-04-22 [1] CRAN (R 3.6.1)
 shinyjs         1.0        2018-01-08 [1] CRAN (R 3.6.1)
 shinystan       2.5.0      2018-05-01 [1] CRAN (R 3.6.1)
 shinythemes     1.1.2      2018-11-06 [1] CRAN (R 3.6.1)
 SnowballC       0.6.0      2019-01-15 [1] CRAN (R 3.6.0)
 StanHeaders     2.19.0     2019-09-07 [1] CRAN (R 3.6.1)
 stringi         1.4.6      2020-02-17 [1] CRAN (R 3.6.2)
 stringr       * 1.4.0      2019-02-10 [1] CRAN (R 3.6.1)
 survival        2.44-1.1   2019-04-01 [2] CRAN (R 3.6.1)
 testthat        2.3.2      2020-03-02 [1] CRAN (R 3.6.3)
 threejs         0.3.1      2017-08-13 [1] CRAN (R 3.6.1)
 tibble        * 3.0.0      2020-03-30 [1] CRAN (R 3.6.3)
 tidymodels    * 0.0.3      2019-10-04 [1] CRAN (R 3.6.1)
 tidyposterior   0.0.2      2018-11-15 [1] CRAN (R 3.6.1)
 tidypredict     0.4.3      2019-09-03 [1] CRAN (R 3.6.1)
 tidyr         * 1.0.2      2020-01-24 [1] CRAN (R 3.6.3)
 tidyselect      1.0.0      2020-01-27 [1] CRAN (R 3.6.3)
 tidytext        0.2.2      2019-07-29 [1] CRAN (R 3.6.1)
 tidyverse     * 1.2.1      2017-11-14 [1] CRAN (R 3.6.1)
 timeDate        3043.102   2018-02-21 [1] CRAN (R 3.6.0)
 tokenizers      0.2.1      2018-03-29 [1] CRAN (R 3.6.1)
 utf8            1.1.4      2018-05-24 [1] CRAN (R 3.6.1)
 vctrs           0.2.4      2020-03-10 [1] CRAN (R 3.6.3)
 viridisLite     0.3.0      2018-02-01 [1] CRAN (R 3.6.1)
 visdat          0.5.3      2019-02-15 [1] CRAN (R 3.6.2)
 withr           2.1.2      2018-03-15 [1] CRAN (R 3.6.1)
 workflows       0.1.0      2019-12-30 [1] CRAN (R 3.6.2)
 xfun            0.8        2019-06-25 [1] CRAN (R 3.6.1)
 xml2            1.2.2      2019-08-09 [1] CRAN (R 3.6.1)
 xtable          1.8-4      2019-04-21 [1] CRAN (R 3.6.2)
 xts             0.11-2     2018-11-05 [1] CRAN (R 3.6.1)
 yardstick     * 0.0.4      2019-08-26 [1] CRAN (R 3.6.1)
 zoo             1.8-6      2019-05-28 [1] CRAN (R 3.6.1)
topepo commented 4 years ago

I think that it is related to diet being character instead of factor. I'll take a deeper look.

lbenz-mdsol commented 4 years ago

I am seeing something similar even when variables are factor and not character. Thanks for looking into this!

library(recipes)
#> Warning: package 'recipes' was built under R version 3.6.2
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(tidyverse)
a <- data.frame(y = 1:4, x = factor(c(letters[1:3], NA)))
b <- data.frame(y = 1:5, x = factor(c(letters[1:4], NA)))

recipe(y ~ ., data = a) %>% 
  step_unknown(all_nominal(), new_level = "missing") %>%
  step_novel(all_nominal()) %>%
  prep() %>% 
  bake(b)
#> # A tibble: 5 x 2
#>   x           y
#>   <fct>   <int>
#> 1 a           1
#> 2 b           2
#> 3 c           3
#> 4 <NA>        4
#> 5 missing     5

recipe(y ~ ., data = a) %>% 
  step_novel(all_nominal()) %>%
  step_unknown(all_nominal(), new_level = "missing") %>%
  prep() %>% 
  bake(b)
#> # A tibble: 5 x 2
#>   x           y
#>   <fct>   <int>
#> 1 a           1
#> 2 b           2
#> 3 c           3
#> 4 new         4
#> 5 missing     5
juliasilge commented 3 years ago

I looked into this a bit today, and this is happening because step_unknown() sets the factor levels using the levels in the object already, after replacing the NA values:

https://github.com/tidymodels/recipes/blob/abdb5a044fd373648bb99003a8211010a8247d2c/R/unknown.R#L138-L141

In the example above, this means that when step_unknown() comes first, the steps go:

What do we think the best option is? 🤔

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org