tidymodels / hardhat

Construct Modeling Packages
https://hardhat.tidymodels.org
Other
103 stars 17 forks source link

"Novel levels" warning when using recipes::step_unknown #131

Closed prockenschaub closed 4 years ago

prockenschaub commented 4 years ago

The problem

I have data that has missing values in a factor variable. I am dealing with these by using recipes::step_unknown(. , all_nominal()). When I am running tune_grid in this setting, it results in warnings about novel levels during the prediction

#> "model 1/1 (predictions): Novel levels found in column '[column_name]': NA. 
#> The levels have been removed, and values have been coerced to 'NA'."

I tracked down the warning to the scream function in the hardhat package, and it seems that everything works fine despite the warning:

library(hardhat)
data <- data.frame(x = factor(c("a", "a", "b", NA, "a", NA, "c"), levels = c("a", "b" ,"c")))
hardhat::scream(data, ptype = data[0, , drop = FALSE], allow_novel_levels = FALSE)
#> Warning: Novel levels found in column 'x': NA. The levels have been removed, and
#> values have been coerced to 'NA'.
#>      x
#> 1    a
#> 2    a
#> 3    b
#> 4 <NA>
#> 5    a
#> 6 <NA>
#> 7    c

Does this belong here (because it should provide scream with a different parameter) or to hardhat's issues (because it is an error in how scream works)?

Reproducible example (with tune)


library(magrittr)
library(recipes)
library(rsample)
library(parsnip)
library(tune)

set.seed(1234)
mtcars_tb <- mtcars %>%
  as_tibble() %>%
  mutate(vs = factor(c(sample(vs, 22), rep(NA_integer_, 10))))

set.seed(1234)
cv_fold_mtc <- vfold_cv(mtcars_tb, v = 2)

lasso_mod <-
  linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet") 

rec <- recipe(mpg ~ disp + vs, data = mtcars_tb) %>%
  step_unknown(all_nominal()) %>%
  step_dummy(all_nominal())

tune_grid(
  rec,
  lasso_mod,
  resamples = cv_fold_mtc,
  control = control_resamples(verbose = TRUE)
)
#> i Fold1: recipe
#> v Fold1: recipe
#> i Fold1: model 1/1
#> v Fold1: model 1/1
#> i Fold1: model 1/1 (predictions)
#> ! Fold1: model 1/1 (predictions): Novel levels found in column 'vs': NA. The leve...
#> i Fold2: recipe
#> v Fold2: recipe
#> i Fold2: model 1/1
#> v Fold2: model 1/1
#> i Fold2: model 1/1 (predictions)
#> ! Fold2: model 1/1 (predictions): Novel levels found in column 'vs': NA. The leve...
#> #  2-fold cross-validation 
#> # A tibble: 2 x 4
#>   splits          id    .metrics          .notes          
#> * <list>          <chr> <list>            <list>          
#> 1 <split [16/16]> Fold1 <tibble [20 x 4]> <tibble [1 x 1]>
#> 2 <split [16/16]> Fold2 <tibble [20 x 4]> <tibble [1 x 1]>
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.3 (2020-02-29)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United Kingdom.1252 
#>  ctype    English_United Kingdom.1252 
#>  tz       Europe/Berlin               
#>  date     2020-03-24                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source        
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.1)
#>  class         7.3-15     2019-01-01 [2] CRAN (R 3.6.3)
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 3.6.3)
#>  codetools     0.2-16     2018-12-24 [2] CRAN (R 3.6.3)
#>  colorspace    1.4-1      2019-03-18 [1] CRAN (R 3.6.1)
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.1)
#>  dials         0.0.4      2019-12-02 [1] CRAN (R 3.6.3)
#>  DiceDesign    1.8-1      2019-07-31 [1] CRAN (R 3.6.3)
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 3.6.3)
#>  dplyr       * 0.8.5      2020-03-07 [1] CRAN (R 3.6.3)
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 3.6.1)
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 3.6.1)
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 3.6.2)
#>  foreach       1.4.8      2020-02-09 [1] CRAN (R 3.6.3)
#>  furrr         0.1.0      2018-05-16 [1] CRAN (R 3.6.1)
#>  future        1.16.0     2020-01-16 [1] CRAN (R 3.6.3)
#>  generics      0.0.2      2018-11-29 [1] CRAN (R 3.6.1)
#>  ggplot2       3.3.0      2020-03-05 [1] CRAN (R 3.6.3)
#>  glmnet        3.0-2      2019-12-11 [1] CRAN (R 3.6.2)
#>  globals       0.12.5     2019-12-07 [1] CRAN (R 3.6.1)
#>  glue          1.3.2      2020-03-12 [1] CRAN (R 3.6.3)
#>  gower         0.2.1      2019-05-14 [1] CRAN (R 3.6.0)
#>  GPfit         1.0-8      2019-02-08 [1] CRAN (R 3.6.3)
#>  gtable        0.3.0      2019-03-25 [1] CRAN (R 3.6.1)
#>  hardhat       0.1.2      2020-02-28 [1] CRAN (R 3.6.3)
#>  highr         0.8        2019-03-20 [1] CRAN (R 3.6.1)
#>  htmltools     0.4.0      2019-10-04 [1] CRAN (R 3.6.3)
#>  ipred         0.9-9      2019-04-28 [1] CRAN (R 3.6.1)
#>  iterators     1.0.12     2019-07-26 [1] CRAN (R 3.6.1)
#>  knitr         1.28       2020-02-06 [1] CRAN (R 3.6.3)
#>  lattice       0.20-38    2018-11-04 [2] CRAN (R 3.6.3)
#>  lava          1.6.7      2020-03-05 [1] CRAN (R 3.6.3)
#>  lhs           1.0.1      2019-02-03 [1] CRAN (R 3.6.3)
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 3.6.3)
#>  listenv       0.8.0      2019-12-05 [1] CRAN (R 3.6.3)
#>  lubridate     1.7.4      2018-04-11 [1] CRAN (R 3.6.1)
#>  magrittr    * 1.5        2014-11-22 [1] CRAN (R 3.6.1)
#>  MASS          7.3-51.5   2019-12-20 [2] CRAN (R 3.6.3)
#>  Matrix        1.2-18     2019-11-27 [2] CRAN (R 3.6.3)
#>  munsell       0.5.0      2018-06-12 [1] CRAN (R 3.6.1)
#>  nnet          7.3-12     2016-02-02 [2] CRAN (R 3.6.3)
#>  parsnip     * 0.0.5      2020-01-07 [1] CRAN (R 3.6.3)
#>  pillar        1.4.3      2019-12-20 [1] CRAN (R 3.6.2)
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 3.6.2)
#>  plyr          1.8.6      2020-03-03 [1] CRAN (R 3.6.3)
#>  pROC          1.16.1     2020-01-14 [1] CRAN (R 3.6.3)
#>  prodlim       2019.11.13 2019-11-17 [1] CRAN (R 3.6.3)
#>  purrr         0.3.3      2019-10-18 [1] CRAN (R 3.6.2)
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 3.6.2)
#>  Rcpp          1.0.3      2019-11-08 [1] CRAN (R 3.6.2)
#>  recipes     * 0.1.9      2020-01-07 [1] CRAN (R 3.6.3)
#>  rlang         0.4.5      2020-03-01 [1] CRAN (R 3.6.3)
#>  rmarkdown     2.1        2020-01-20 [1] CRAN (R 3.6.3)
#>  rpart         4.1-15     2019-04-12 [2] CRAN (R 3.6.3)
#>  rsample     * 0.0.5      2019-07-12 [1] CRAN (R 3.6.1)
#>  scales        1.1.0      2019-11-18 [1] CRAN (R 3.6.3)
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.1)
#>  shape         1.4.4      2018-02-07 [1] CRAN (R 3.6.0)
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 3.6.2)
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.6.1)
#>  survival      3.1-8      2019-12-03 [2] CRAN (R 3.6.3)
#>  tibble        2.1.3      2019-06-06 [1] CRAN (R 3.6.1)
#>  tidyr       * 1.0.2      2020-01-24 [1] CRAN (R 3.6.3)
#>  tidyselect    1.0.0      2020-01-27 [1] CRAN (R 3.6.2)
#>  timeDate      3043.102   2018-02-21 [1] CRAN (R 3.6.0)
#>  tune        * 0.0.1      2020-02-11 [1] CRAN (R 3.6.3)
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 3.6.1)
#>  vctrs         0.2.4      2020-03-10 [1] CRAN (R 3.6.3)
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.6.1)
#>  workflows     0.1.0      2019-12-30 [1] CRAN (R 3.6.3)
#>  xfun          0.12       2020-01-13 [1] CRAN (R 3.6.3)
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 3.6.2)
#>  yardstick     0.0.5      2020-01-23 [1] CRAN (R 3.6.3)
#> 
#> [1] C:/Users/rocke/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.3/library
gsimchoni commented 4 years ago

Same here, would also like to point out that switching to logistic_reg() above (e.g. by predicting factor(am)) or step_modeimpute() will not help.

Also tried:

svm_mod <-
  svm_rbf(mode = "regression", cost = tune(), rbf_sigma = tune()) %>%
  set_engine("kernlab")

instead of lasso_mod above, getting the same warning.

> sessioninfo::session_info()
- Session info ----------------------------------------------------------------------
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_Israel.1252         
 ctype    English_Israel.1252         
 tz       Asia/Jerusalem              
 date     2020-05-01                  

- Packages --------------------------------------------------------------------------
 package       * version      date       lib source                             
 assertthat      0.2.1        2019-03-21 [1] CRAN (R 3.6.1)                     
 backports       1.1.6        2020-04-05 [1] CRAN (R 3.6.3)                     
 base64enc       0.1-3        2015-07-28 [1] CRAN (R 3.6.0)                     
 bayesplot       1.7.1        2019-12-01 [1] CRAN (R 3.6.1)                     
 BBmisc          1.11         2017-03-10 [1] CRAN (R 3.6.3)                     
 boot            1.3-22       2019-04-02 [2] CRAN (R 3.6.1)                     
 broom         * 0.5.2        2019-04-07 [1] CRAN (R 3.6.1)                     
 callr           3.4.3        2020-03-28 [1] CRAN (R 3.6.3)                     
 cellranger      1.1.0        2016-07-27 [1] CRAN (R 3.6.1)                     
 checkmate       2.0.0        2020-02-06 [1] CRAN (R 3.6.3)                     
 class           7.3-15       2019-01-01 [2] CRAN (R 3.6.1)                     
 cli             2.0.2        2020-02-28 [1] CRAN (R 3.6.3)                     
 codetools       0.2-16       2018-12-24 [2] CRAN (R 3.6.1)                     
 colorspace      1.4-1        2019-03-18 [1] CRAN (R 3.6.1)                     
 colourpicker    1.0          2017-09-27 [1] CRAN (R 3.6.1)                     
 crayon          1.3.4        2017-09-16 [1] CRAN (R 3.6.1)                     
 crosstalk       1.0.0        2016-12-21 [1] CRAN (R 3.6.1)                     
 data.table      1.12.8       2019-12-09 [1] CRAN (R 3.6.3)                     
 desc            1.2.0        2018-05-01 [1] CRAN (R 3.6.1)                     
 dials         * 0.0.6        2020-04-03 [1] CRAN (R 3.6.1)                     
 DiceDesign      1.8-1        2019-07-31 [1] CRAN (R 3.6.1)                     
 digest          0.6.25       2020-02-23 [1] CRAN (R 3.6.3)                     
 doParallel      1.0.15       2019-08-02 [1] CRAN (R 3.6.3)                     
 dplyr         * 0.8.5        2020-03-07 [1] CRAN (R 3.6.3)                     
 DT              0.7          2019-06-11 [1] CRAN (R 3.6.1)                     
 dygraphs        1.1.1.6      2018-07-11 [1] CRAN (R 3.6.1)                     
 ellipsis        0.3.0        2019-09-20 [1] CRAN (R 3.6.1)                     
 embed         * 0.0.6        2020-03-17 [1] CRAN (R 3.6.3)                     
 emo             0.0.0.9000   2019-11-04 [1] Github (hadley/emo@02a5206)        
 evaluate        0.14         2019-05-28 [1] CRAN (R 3.6.1)                     
 fansi           0.4.1        2020-01-08 [1] CRAN (R 3.6.3)                     
 farver          2.0.3        2020-01-16 [1] CRAN (R 3.6.3)                     
 fastmatch       1.1-0        2017-01-28 [1] CRAN (R 3.6.0)                     
 float           0.2-3        2019-05-31 [1] CRAN (R 3.6.0)                     
 FNN             1.1.3        2019-02-15 [1] CRAN (R 3.6.3)                     
 forcats       * 0.4.0        2019-02-17 [1] CRAN (R 3.6.1)                     
 foreach         1.5.0        2020-03-30 [1] CRAN (R 3.6.3)                     
 furrr           0.1.0        2018-05-16 [1] CRAN (R 3.6.1)                     
 future          1.14.0       2019-07-02 [1] CRAN (R 3.6.1)                     
 generics        0.0.2        2018-11-29 [1] CRAN (R 3.6.1)                     
 ggmosaic      * 0.2.0        2018-09-12 [1] CRAN (R 3.6.1)                     
 ggplot2       * 3.3.0        2020-03-05 [1] CRAN (R 3.6.3)                     
 ggrepel         0.8.1        2019-05-07 [1] CRAN (R 3.6.1)                     
 ggridges        0.5.1        2018-09-27 [1] CRAN (R 3.6.1)                     
 glmnet          3.0-1        2019-11-15 [1] CRAN (R 3.6.1)                     
 globals         0.12.4       2018-10-11 [1] CRAN (R 3.6.0)                     
 glue          * 1.4.0        2020-04-03 [1] CRAN (R 3.6.3)                     
 gower           0.2.1        2019-05-14 [1] CRAN (R 3.6.0)                     
 GPfit           1.0-8        2019-02-08 [1] CRAN (R 3.6.2)                     
 gridExtra       2.3          2017-09-09 [1] CRAN (R 3.6.1)                     
 gtable          0.3.0        2019-03-25 [1] CRAN (R 3.6.1)                     
 gtools          3.8.1        2018-06-26 [1] CRAN (R 3.6.0)                     
 hardhat         0.1.1        2020-01-08 [1] CRAN (R 3.6.2)                     
 haven           2.1.1        2019-07-04 [1] CRAN (R 3.6.1)                     
 hms             0.5.2        2019-10-30 [1] CRAN (R 3.6.1)                     
 htmltools       0.3.6        2017-04-28 [1] CRAN (R 3.6.1)                     
 htmlwidgets     1.3          2018-09-30 [1] CRAN (R 3.6.1)                     
 httpuv          1.5.1        2019-04-05 [1] CRAN (R 3.6.1)                     
 httr            1.4.1        2019-08-05 [1] CRAN (R 3.6.1)                     
 igraph          1.2.4.1      2019-04-22 [1] CRAN (R 3.6.1)                     
 infer         * 0.5.0        2019-09-27 [1] CRAN (R 3.6.1)                     
 inline          0.3.15       2018-05-18 [1] CRAN (R 3.6.1)                     
 ipred           0.9-9        2019-04-28 [1] CRAN (R 3.6.1)                     
 iterators       1.0.12       2019-07-26 [1] CRAN (R 3.6.1)                     
 janeaustenr     0.1.5        2017-06-10 [1] CRAN (R 3.6.1)                     
 jsonlite        1.6          2018-12-07 [1] CRAN (R 3.6.1)                     
 keras           2.2.4.1.9001 2019-09-10 [1] Github (rstudio/keras@95ea0b5)     
 kernlab         0.9-27       2018-08-10 [1] CRAN (R 3.6.0)                     
 knitr           1.23         2019-05-18 [1] CRAN (R 3.6.1)                     
 labeling        0.3          2014-08-23 [1] CRAN (R 3.6.0)                     
 later           1.0.0        2019-10-04 [1] CRAN (R 3.6.1)                     
 lattice         0.20-38      2018-11-04 [2] CRAN (R 3.6.1)                     
 lava            1.6.7        2020-03-05 [1] CRAN (R 3.6.3)                     
 lazyeval        0.2.2        2019-03-15 [1] CRAN (R 3.6.1)                     
 lgr             0.3.4        2020-03-20 [1] CRAN (R 3.6.3)                     
 lhs             1.0.1        2019-02-03 [1] CRAN (R 3.6.2)                     
 lifecycle       0.2.0        2020-03-06 [1] CRAN (R 3.6.3)                     
 listenv         0.7.0        2018-01-21 [1] CRAN (R 3.6.1)                     
 lme4            1.1-21       2019-03-05 [1] CRAN (R 3.6.1)                     
 loo             2.1.0        2019-03-13 [1] CRAN (R 3.6.1)                     
 lubridate       1.7.8        2020-04-06 [1] CRAN (R 3.6.3)                     
 magrittr        1.5          2014-11-22 [1] CRAN (R 3.6.1)                     
 markdown        1.0          2019-06-07 [1] CRAN (R 3.6.1)                     
 MASS            7.3-51.4     2019-03-31 [2] CRAN (R 3.6.1)                     
 Matrix          1.2-17       2019-03-22 [2] CRAN (R 3.6.1)                     
 matrixStats     0.55.0       2019-09-07 [1] CRAN (R 3.6.1)                     
 mgcv            1.8-28       2019-03-21 [2] CRAN (R 3.6.1)                     
 mime            0.7          2019-06-11 [1] CRAN (R 3.6.0)                     
 miniUI          0.1.1.1      2018-05-18 [1] CRAN (R 3.6.1)                     
 minqa           1.2.4        2014-10-09 [1] CRAN (R 3.6.1)                     
 mlapi           0.1.0        2017-12-17 [1] CRAN (R 3.6.3)                     
 mlr             2.17.1       2020-03-24 [1] CRAN (R 3.6.3)                     
 modelr          0.1.5        2019-08-08 [1] CRAN (R 3.6.1)                     
 munsell         0.5.0        2018-06-12 [1] CRAN (R 3.6.1)                     
 naniar        * 0.4.2        2019-02-15 [1] CRAN (R 3.6.2)                     
 nlme            3.1-140      2019-05-12 [2] CRAN (R 3.6.1)                     
 nloptr          1.2.1        2018-10-03 [1] CRAN (R 3.6.1)                     
 nnet            7.3-12       2016-02-02 [2] CRAN (R 3.6.1)                     
 packrat         0.5.0        2018-11-14 [1] CRAN (R 3.6.1)                     
 parallelMap     1.5.0        2020-03-26 [1] CRAN (R 3.6.3)                     
 ParamHelpers    1.14         2020-03-24 [1] CRAN (R 3.6.3)                     
 parsnip       * 0.0.4        2019-11-02 [1] CRAN (R 3.6.1)                     
 pillar          1.4.3        2019-12-20 [1] CRAN (R 3.6.3)                     
 pkgbuild        1.0.6        2019-10-09 [1] CRAN (R 3.6.3)                     
 pkgconfig       2.0.3        2019-09-22 [1] CRAN (R 3.6.1)                     
 pkgload         1.0.2        2018-10-29 [1] CRAN (R 3.6.1)                     
 plotly          4.9.0        2019-04-10 [1] CRAN (R 3.6.1)                     
 plyr            1.8.4        2016-06-08 [1] CRAN (R 3.6.1)                     
 prettyunits     1.1.1        2020-01-24 [1] CRAN (R 3.6.3)                     
 pROC            1.15.3       2019-07-21 [1] CRAN (R 3.6.1)                     
 processx        3.4.2        2020-02-09 [1] CRAN (R 3.6.3)                     
 prodlim         2019.11.13   2019-11-17 [1] CRAN (R 3.6.3)                     
 productplots    0.1.1        2016-07-02 [1] CRAN (R 3.6.1)                     
 promises        1.0.1        2018-04-13 [1] CRAN (R 3.6.1)                     
 ps              1.3.2        2020-02-13 [1] CRAN (R 3.6.3)                     
 purrr         * 0.3.3        2019-10-18 [1] CRAN (R 3.6.1)                     
 R6              2.4.1        2019-11-12 [1] CRAN (R 3.6.1)                     
 RANN            2.6.1        2019-01-08 [1] CRAN (R 3.6.3)                     
 Rcpp            1.0.4.6      2020-04-09 [1] CRAN (R 3.6.3)                     
 readr         * 1.3.1        2018-12-21 [1] CRAN (R 3.6.1)                     
 readxl          1.3.1        2019-03-13 [1] CRAN (R 3.6.1)                     
 recipes       * 0.1.10       2020-03-18 [1] CRAN (R 3.6.3)                     
 reshape2        1.4.3        2017-12-11 [1] CRAN (R 3.6.1)                     
 reticulate      1.13.0-9000  2019-09-10 [1] Github (rstudio/reticulate@f17091b)
 RhpcBLASctl     0.20-17      2020-01-17 [1] CRAN (R 3.6.2)                     
 rlang           0.4.5        2020-03-01 [1] CRAN (R 3.6.3)                     
 rmarkdown       1.14         2019-07-12 [1] CRAN (R 3.6.1)                     
 ROSE            0.0-3        2014-07-15 [1] CRAN (R 3.6.3)                     
 rpart           4.1-15       2019-04-12 [2] CRAN (R 3.6.1)                     
 rprojroot       1.3-2        2018-01-03 [1] CRAN (R 3.6.1)                     
 rsample       * 0.0.5        2019-07-12 [1] CRAN (R 3.6.1)                     
 rsconnect       0.8.15       2019-07-22 [1] CRAN (R 3.6.1)                     
 rsparse         0.4.0        2020-04-01 [1] CRAN (R 3.6.3)                     
 rstan           2.19.2       2019-07-09 [1] CRAN (R 3.6.1)                     
 rstanarm        2.19.2       2019-10-03 [1] CRAN (R 3.6.1)                     
 rstantools      2.0.0        2019-09-15 [1] CRAN (R 3.6.1)                     
 rstudioapi      0.11         2020-02-07 [1] CRAN (R 3.6.3)                     
 rvest           0.3.4        2019-05-15 [1] CRAN (R 3.6.1)                     
 scales        * 1.1.0        2019-11-18 [1] CRAN (R 3.6.3)                     
 sessioninfo     1.1.1        2018-11-05 [1] CRAN (R 3.6.1)                     
 shape           1.4.4        2018-02-07 [1] CRAN (R 3.6.0)                     
 shiny           1.3.2        2019-04-22 [1] CRAN (R 3.6.1)                     
 shinyjs         1.0          2018-01-08 [1] CRAN (R 3.6.1)                     
 shinystan       2.5.0        2018-05-01 [1] CRAN (R 3.6.1)                     
 shinythemes     1.1.2        2018-11-06 [1] CRAN (R 3.6.1)                     
 SnowballC       0.6.0        2019-01-15 [1] CRAN (R 3.6.0)                     
 StanHeaders     2.19.0       2019-09-07 [1] CRAN (R 3.6.1)                     
 stopwords       1.0          2019-07-24 [1] CRAN (R 3.6.1)                     
 stringi         1.4.6        2020-02-17 [1] CRAN (R 3.6.2)                     
 stringr       * 1.4.0        2019-02-10 [1] CRAN (R 3.6.1)                     
 survival        2.44-1.1     2019-04-01 [2] CRAN (R 3.6.1)                     
 tensorflow      1.14.0.9000  2019-09-10 [1] Github (rstudio/tensorflow@5185c97)
 testthat        2.3.2        2020-03-02 [1] CRAN (R 3.6.3)                     
 text2vec        0.6          2020-02-18 [1] CRAN (R 3.6.3)                     
 textfeatures    0.3.3        2019-09-03 [1] CRAN (R 3.6.3)                     
 textrecipes   * 0.2.0        2020-04-14 [1] CRAN (R 3.6.3)                     
 tfruns          1.4          2018-08-25 [1] CRAN (R 3.6.1)                     
 themis        * 0.1.0        2020-01-13 [1] CRAN (R 3.6.3)                     
 threejs         0.3.1        2017-08-13 [1] CRAN (R 3.6.1)                     
 tibble        * 3.0.0        2020-03-30 [1] CRAN (R 3.6.3)                     
 tidymodels    * 0.0.3        2019-10-04 [1] CRAN (R 3.6.1)                     
 tidyposterior   0.0.2        2018-11-15 [1] CRAN (R 3.6.1)                     
 tidypredict     0.4.3        2019-09-03 [1] CRAN (R 3.6.1)                     
 tidyr         * 1.0.2        2020-01-24 [1] CRAN (R 3.6.3)                     
 tidyselect      1.0.0        2020-01-27 [1] CRAN (R 3.6.3)                     
 tidytext      * 0.2.2        2019-07-29 [1] CRAN (R 3.6.1)                     
 tidyverse     * 1.2.1        2017-11-14 [1] CRAN (R 3.6.1)                     
 timeDate        3043.102     2018-02-21 [1] CRAN (R 3.6.0)                     
 tokenizers      0.2.1        2018-03-29 [1] CRAN (R 3.6.1)                     
 tune          * 0.0.1        2020-02-11 [1] CRAN (R 3.6.1)                     
 unbalanced      2.0          2015-06-26 [1] CRAN (R 3.6.3)                     
 utf8            1.1.4        2018-05-24 [1] CRAN (R 3.6.1)                     
 uwot            0.1.8        2020-03-16 [1] CRAN (R 3.6.3)                     
 vctrs           0.2.4        2020-03-10 [1] CRAN (R 3.6.3)                     
 viridisLite     0.3.0        2018-02-01 [1] CRAN (R 3.6.1)                     
 visdat          0.5.3        2019-02-15 [1] CRAN (R 3.6.2)                     
 whisker         0.4          2019-08-28 [1] CRAN (R 3.6.1)                     
 withr           2.1.2        2018-03-15 [1] CRAN (R 3.6.1)                     
 workflows       0.1.0        2019-12-30 [1] CRAN (R 3.6.2)                     
 xaringan        0.13         2019-10-30 [1] CRAN (R 3.6.1)                     
 xfun            0.8          2019-06-25 [1] CRAN (R 3.6.1)                     
 xml2            1.2.2        2019-08-09 [1] CRAN (R 3.6.1)                     
 xtable          1.8-4        2019-04-21 [1] CRAN (R 3.6.2)                     
 xts             0.11-2       2018-11-05 [1] CRAN (R 3.6.1)                     
 yaml            2.2.0        2018-07-25 [1] CRAN (R 3.6.0)                     
 yardstick     * 0.0.4        2019-08-26 [1] CRAN (R 3.6.1)                     
 zeallot         0.1.0        2018-01-28 [1] CRAN (R 3.6.1)                     
 zoo             1.8-6        2019-05-28 [1] CRAN (R 3.6.1)                     

[1] C:/Users/gsimc/Documents/R/win-library/3.6
[2] C:/Program Files/R/R-3.6.1/library
topepo commented 4 years ago

This does seem to be either a workflows or hardhat issue:

library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.4      ✓ recipes   0.1.12
#> ✓ dials     0.0.6      ✓ rsample   0.0.6 
#> ✓ dplyr     0.8.5      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.0      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.0 
#> ✓ parsnip   0.1.0      ✓ yardstick 0.0.5 
#> ✓ purrr     0.3.4
#> Warning: package 'parsnip' was built under R version 3.6.2
#> Warning: package 'rsample' was built under R version 3.6.2
#> Warning: package 'tibble' was built under R version 3.6.2
#> ── Conflicts ──────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step()   masks stats::step()

set.seed(1234)
mtcars_tb <- mtcars %>%
  as_tibble() %>%
  mutate(vs = factor(c(sample(vs, 22), rep(NA_integer_, 10))))

set.seed(1234)
cv_fold_mtc <- vfold_cv(mtcars_tb, v = 2)

# as an example
split <- cv_fold_mtc$splits[[1]]

lasso_mod <-
  linear_reg(penalty = .01, mixture = 1) %>%
  set_engine("glmnet") 

rec <- recipe(mpg ~ disp + vs, data = analysis(split)) %>%
  step_unknown(all_nominal()) %>%
  step_dummy(all_nominal())

rec_fit <- rec %>% prep()
model_fit <- lasso_mod %>% fit(mpg ~ ., data = juice(rec_fit))
model_pred <- predict(model_fit, bake(rec_fit, assessment(split)))

wflow <- 
  workflow() %>% 
  add_model(lasso_mod) %>% 
  add_recipe(rec)

wflow_fit <- wflow %>% fit(data = analysis(split))
wflow_pred <- predict(wflow_fit, assessment(split))
#> Warning: Novel levels found in column 'vs': NA. The levels have been removed,
#> and values have been coerced to 'NA'.

Created on 2020-05-01 by the reprex package (v0.3.0)

DavisVaughan commented 4 years ago

I think it is more likely that this is a hardhat issue than workflows. Likely I'm not accounting for NA as being ok somewhere in scream()

DavisVaughan commented 4 years ago

Minimal reprex

library(hardhat)
library(vctrs)

df <- data.frame(x = factor(c("x", NA)))
ptype <- vec_ptype(df)

scream(df, ptype = ptype)
#> Warning: Novel levels found in column 'x': NA. The levels have been removed, and
#> values have been coerced to 'NA'.
#>      x
#> 1    x
#> 2 <NA>

Created on 2020-05-01 by the reprex package (v0.3.0)

The main problem is that check_novel_levels.factor() is using unique(x) to get the levels, when it should be using levels(x). unique(x) will pull in NA as a level. Making this change will also require check_novel_levels.character(), which currently uses the same path as for factors. The merged code path is the reason I tried to use unique() in the first place

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.