tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
573 stars 113 forks source link

"model 1/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop(\"'charvec' has non-NA entries of different number of characters\"): missing value where TRUE/FALSE needed" #474

Closed konradsemsch closed 4 years ago

konradsemsch commented 4 years ago

I came across this issue when trying to run glmnet model for classification and I'm unable to track the reason for this. Prepping the dataset proves that there are no Na before fitting the model. The issue seems to occur when while tuning the function is trying to make Cv predictions.

Minimal, reproducible example:

library(modeldata)
library(tidymodels)
library(tidyverse)
library(magrittr)

data("okc")
colnames(okc) <- tolower(names(okc))

okc <- sample_n(okc, 1000)

strata <- "class"
split <- initial_split(okc, prop = 0.8, strata = strata)

train <- training(split)
test <- testing(split)

cv <- vfold_cv(
  train,
  v = 5,
  repeats = 1,
  strata = strata
)

recipe_info <- train %>%
  recipe() %>%

  ### Roles assignment
  update_role(strata, new_role = "outcome") %>% 
  update_role(-one_of(strata), new_role = "predictors")

recipe_var_date <- summary(recipe_info) %>% 
  filter(type == "date") %>% 
  pull(variable)

recipe_var_numeric <- summary(recipe_info) %>% 
  filter(type == "numeric") %>% 
  pull(variable)

recipe <- recipe_info %>% 
  ### Imputation
  # Numeric predictors
  step_medianimpute(all_numeric(), all_predictors()) %>%

  # Categorical predictors
  step_modeimpute(all_nominal(), all_predictors()) %>%

  # Time predictors
  step_mutate_at(one_of(recipe_var_date), fn = as.factor) %>% 
  step_unknown(date) %>% 
  step_mutate_at(one_of(recipe_var_date), fn = as.Date) %>% 

  ### Handling time predictors
  step_date(one_of(recipe_var_date)) %>%
  step_holiday(one_of(recipe_var_date)) %>% 

  ### Individual transformations (optional)

  ### Lumping infrequent categories (optional)
  step_other(all_nominal(), all_predictors(), other = "infrequent") %>% 

  ### Dummyfying (optional features)
  step_dummy(all_nominal(), all_predictors(), -all_outcomes()) %>% 

  ### Interactions (optional)

  ### Normalization (optional)
  step_normalize(one_of(recipe_var_numeric)) %>% 

  ### Multivariate transformations (optional)

  ### Removing unnecessary predictors
  step_rm(has_type("date"))

  ### Checks

# Check the structure of the input file
prep(recipe) %>% juice() %>% glimpse()

model <- logistic_reg(
  penalty = tune(),
  mixture = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

grid <- grid_max_entropy(
  penalty(),
  mixture(),
  size = 10,
  variogram_range = 1
)

workflow <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(model)

tuning <- tune_grid(
  workflow,
  resamples = cv,
  grid = grid,
  metrics = metric_set(roc_auc),
  control = control_grid(verbose = TRUE)
)
i Fold1: recipe
! Fold1: recipe: The following variables are not factor vectors and will be ignored: `date_year`, `date_LaborDay...
✓ Fold1: recipe
i Fold1: model  1/10
✓ Fold1: model  1/10
i Fold1: model  1/10 (predictions)
x Fold1: model  1/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  2/10
✓ Fold1: model  2/10
i Fold1: model  2/10 (predictions)
x Fold1: model  2/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  3/10
✓ Fold1: model  3/10
i Fold1: model  3/10 (predictions)
x Fold1: model  3/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  4/10
✓ Fold1: model  4/10
i Fold1: model  4/10 (predictions)
x Fold1: model  4/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  5/10
✓ Fold1: model  5/10
i Fold1: model  5/10 (predictions)
x Fold1: model  5/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  6/10
✓ Fold1: model  6/10
i Fold1: model  6/10 (predictions)
x Fold1: model  6/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  7/10
✓ Fold1: model  7/10
i Fold1: model  7/10 (predictions)
x Fold1: model  7/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  8/10
✓ Fold1: model  8/10
i Fold1: model  8/10 (predictions)
x Fold1: model  8/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model  9/10
✓ Fold1: model  9/10
i Fold1: model  9/10 (predictions)
x Fold1: model  9/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold1: model 10/10
✓ Fold1: model 10/10
i Fold1: model 10/10 (predictions)
x Fold1: model 10/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold2: recipe
! Fold2: recipe: The following variables are not factor vectors and will be ignored: `date_year`, `date_LaborDay...
✓ Fold2: recipe
i Fold2: model  1/10
✓ Fold2: model  1/10
i Fold2: model  1/10 (predictions)
x Fold2: model  1/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  2/10
✓ Fold2: model  2/10
i Fold2: model  2/10 (predictions)
x Fold2: model  2/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  3/10
✓ Fold2: model  3/10
i Fold2: model  3/10 (predictions)
x Fold2: model  3/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  4/10
✓ Fold2: model  4/10
i Fold2: model  4/10 (predictions)
x Fold2: model  4/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  5/10
✓ Fold2: model  5/10
i Fold2: model  5/10 (predictions)
x Fold2: model  5/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  6/10
✓ Fold2: model  6/10
i Fold2: model  6/10 (predictions)
x Fold2: model  6/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  7/10
✓ Fold2: model  7/10
i Fold2: model  7/10 (predictions)
x Fold2: model  7/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  8/10
✓ Fold2: model  8/10
i Fold2: model  8/10 (predictions)
x Fold2: model  8/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model  9/10
✓ Fold2: model  9/10
i Fold2: model  9/10 (predictions)
x Fold2: model  9/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold2: model 10/10
✓ Fold2: model 10/10
i Fold2: model 10/10 (predictions)
x Fold2: model 10/10 (predictions): Error in strptime(charvec, format, tz = "GMT"): invalid 'x' argument
i Fold3: recipe
! Fold3: recipe: The following variables are not factor vectors and will be ignored: `date_year`, `date_LaborDay...
✓ Fold3: recipe
i Fold3: model  1/10
✓ Fold3: model  1/10
i Fold3: model  1/10 (predictions)
x Fold3: model  1/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  2/10
✓ Fold3: model  2/10
i Fold3: model  2/10 (predictions)
x Fold3: model  2/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  3/10
✓ Fold3: model  3/10
i Fold3: model  3/10 (predictions)
x Fold3: model  3/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  4/10
✓ Fold3: model  4/10
i Fold3: model  4/10 (predictions)
x Fold3: model  4/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  5/10
✓ Fold3: model  5/10
i Fold3: model  5/10 (predictions)
x Fold3: model  5/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  6/10
✓ Fold3: model  6/10
i Fold3: model  6/10 (predictions)
x Fold3: model  6/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  7/10
✓ Fold3: model  7/10
i Fold3: model  7/10 (predictions)
x Fold3: model  7/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  8/10
✓ Fold3: model  8/10
i Fold3: model  8/10 (predictions)
x Fold3: model  8/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model  9/10
✓ Fold3: model  9/10
i Fold3: model  9/10 (predictions)
x Fold3: model  9/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold3: model 10/10
✓ Fold3: model 10/10
i Fold3: model 10/10 (predictions)
x Fold3: model 10/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: recipe
! Fold4: recipe: The following variables are not factor vectors and will be ignored: `date_year`, `date_LaborDay...
✓ Fold4: recipe
i Fold4: model  1/10
✓ Fold4: model  1/10
i Fold4: model  1/10 (predictions)
x Fold4: model  1/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  2/10
✓ Fold4: model  2/10
i Fold4: model  2/10 (predictions)
x Fold4: model  2/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  3/10
✓ Fold4: model  3/10
i Fold4: model  3/10 (predictions)
x Fold4: model  3/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  4/10
✓ Fold4: model  4/10
i Fold4: model  4/10 (predictions)
x Fold4: model  4/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  5/10
✓ Fold4: model  5/10
i Fold4: model  5/10 (predictions)
x Fold4: model  5/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  6/10
✓ Fold4: model  6/10
i Fold4: model  6/10 (predictions)
x Fold4: model  6/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  7/10
✓ Fold4: model  7/10
i Fold4: model  7/10 (predictions)
x Fold4: model  7/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  8/10
✓ Fold4: model  8/10
i Fold4: model  8/10 (predictions)
x Fold4: model  8/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model  9/10
✓ Fold4: model  9/10
i Fold4: model  9/10 (predictions)
x Fold4: model  9/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold4: model 10/10
✓ Fold4: model 10/10
i Fold4: model 10/10 (predictions)
x Fold4: model 10/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: recipe
! Fold5: recipe: The following variables are not factor vectors and will be ignored: `date_year`, `date_LaborDay...
✓ Fold5: recipe
i Fold5: model  1/10
✓ Fold5: model  1/10
i Fold5: model  1/10 (predictions)
x Fold5: model  1/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  2/10
✓ Fold5: model  2/10
i Fold5: model  2/10 (predictions)
x Fold5: model  2/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  3/10
✓ Fold5: model  3/10
i Fold5: model  3/10 (predictions)
x Fold5: model  3/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  4/10
✓ Fold5: model  4/10
i Fold5: model  4/10 (predictions)
x Fold5: model  4/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  5/10
✓ Fold5: model  5/10
i Fold5: model  5/10 (predictions)
x Fold5: model  5/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  6/10
✓ Fold5: model  6/10
i Fold5: model  6/10 (predictions)
x Fold5: model  6/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  7/10
✓ Fold5: model  7/10
i Fold5: model  7/10 (predictions)
x Fold5: model  7/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  8/10
✓ Fold5: model  8/10
i Fold5: model  8/10 (predictions)
x Fold5: model  8/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model  9/10
✓ Fold5: model  9/10
i Fold5: model  9/10 (predictions)
x Fold5: model  9/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
i Fold5: model 10/10
✓ Fold5: model 10/10
i Fold5: model 10/10 (predictions)
x Fold5: model 10/10 (predictions): Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of...
Warning message:
All models failed in tune_grid(). See the `.notes` column.

Session Info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magrittr_1.5     forcats_0.4.0    stringr_1.4.0    readr_1.3.1      tidyverse_1.3.0  yardstick_0.0.4 
 [7] workflows_0.1.0  tune_0.0.1       tibble_2.1.3     rsample_0.0.5    tidyr_1.0.0      recipes_0.1.9   
[13] purrr_0.3.3      parsnip_0.0.5    infer_0.5.1      ggplot2_3.2.1    dplyr_0.8.4      dials_0.0.4     
[19] scales_1.1.0     broom_0.5.4      tidymodels_0.1.0 modeldata_0.0.1 

loaded via a namespace (and not attached):
  [1] readxl_1.3.1        backports_1.1.5     tidytext_0.2.2      plyr_1.8.5          igraph_1.2.4.2     
  [6] lazyeval_0.2.2      splines_3.6.1       crosstalk_1.0.0     listenv_0.8.0       SnowballC_0.6.0    
 [11] rstantools_2.0.0    inline_0.3.15       digest_0.6.23       foreach_1.4.7       htmltools_0.4.0    
 [16] rsconnect_0.8.15    fansi_0.4.1         globals_0.12.5      modelr_0.1.5        gower_0.2.1        
 [21] matrixStats_0.55.0  xts_0.11-2          hardhat_0.1.0.9000  prettyunits_1.1.1   colorspace_1.4-1   
 [26] rvest_0.3.5         haven_2.2.0         xfun_0.11           jsonlite_1.6        callr_3.4.0        
 [31] crayon_1.3.4        lme4_1.1-21         zeallot_0.1.0       survival_2.44-1.1   zoo_1.8-6          
 [36] iterators_1.0.12    glue_1.3.1          gtable_0.3.0        ipred_0.9-9         pkgbuild_1.0.6     
 [41] rstan_2.19.2        shape_1.4.4         DBI_1.1.0           miniUI_0.1.1.1      Rcpp_1.0.3         
 [46] xtable_1.8-4        GPfit_1.0-8         stats4_3.6.1        lava_1.6.6          StanHeaders_2.19.0 
 [51] prodlim_2019.11.13  DT_0.10             glmnet_3.0-2        httr_1.4.1          htmlwidgets_1.5.1  
 [56] threejs_0.3.1       ellipsis_0.3.0      pkgconfig_2.0.3     loo_2.1.0           nnet_7.3-12        
 [61] dbplyr_1.4.2        utf8_1.1.4          tidyselect_0.2.5    rlang_0.4.4         DiceDesign_1.8-1   
 [66] reshape2_1.4.3      later_1.0.0         cellranger_1.1.0    munsell_0.5.0       tools_3.6.1        
 [71] cli_2.0.1.9000      generics_0.0.2      ggridges_0.5.1      fastmap_1.0.1       fs_1.3.1           
 [76] processx_3.4.1      knitr_1.26          packrat_0.5.0       future_1.16.0       nlme_3.1-140       
 [81] mime_0.7            rstanarm_2.19.2     xml2_1.2.2          tokenizers_0.2.1    compiler_3.6.1     
 [86] bayesplot_1.7.1     shinythemes_1.1.2   rstudioapi_0.11     reprex_0.3.0        tidyposterior_0.0.2
 [91] lhs_1.0.1           stringi_1.4.5       ps_1.3.0            lattice_0.20-38     Matrix_1.2-17      
 [96] nloptr_1.2.1        markdown_1.1        shinyjs_1.0         vctrs_0.2.1         pillar_1.4.3       
[101] lifecycle_0.1.0     furrr_0.1.0         httpuv_1.5.2        R6_2.4.1            promises_1.1.0     
[106] gridExtra_2.3       janeaustenr_0.1.5   codetools_0.2-16    boot_1.3-22         colourpicker_1.0   
[111] MASS_7.3-51.4       gtools_3.8.1        assertthat_0.2.1    withr_2.1.2         shinystan_2.5.0    
[116] parallel_3.6.1      hms_0.5.2           grid_3.6.1          rpart_4.1-15        timeDate_3043.102  
[121] class_7.3-15        minqa_1.2.4         pROC_1.16.1         tidypredict_0.4.3   shiny_1.4.0        
[126] lubridate_1.7.4     base64enc_0.1-3     dygraphs_1.1.1.6 
topepo commented 4 years ago

One issue is that you are getting variable lists (e.g., recipe_var_numeric) before the recipe has been prepped so they are not capturing the variables that I think that you want. For example, recipe_var_date is always going to be just date since that is what is contained in the recipe before prep.

You should use all_numeric() instead of one_of(recipe_var_numeric) and starts_with("date") instead of recipe_var_date.

I wasn't sure what the purpose of these lines were:

  # Time predictors
  step_mutate_at(one_of(recipe_var_date), fn = as.factor) %>%
  step_unknown(date) %>%
  step_mutate_at(one_of(recipe_var_date), fn = as.Date) %>%

I don't know that a factor for each date is feasible for these data (and you make date features later). Plus, the conversion from factor (with infrequent categories) back to a date encoding would fail for many samples.

I also added a step_zv(). Without it you'll get errors since there are no date values on Christmas day. Since these are all zero, step_normalize() converts them all to NA because of divide by zero.

Finally, I changes selectors that I think are intended to capture factors that are not the outcome from all_nominal(), all_predictors() to all_nominal(), -all_outcomes(). The former includes numeric variables too.

Here is where I got:

library(modeldata)
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidymodels 0.0.4 ──
#> ✓ broom     0.5.4          ✓ recipes   0.1.9     
#> ✓ dials     0.0.4.9000     ✓ rsample   0.0.5     
#> ✓ dplyr     0.8.4          ✓ tibble    2.1.3     
#> ✓ ggplot2   3.2.1          ✓ tune      0.0.1     
#> ✓ infer     0.5.1          ✓ workflows 0.1.0     
#> ✓ parsnip   0.0.5          ✓ yardstick 0.0.5     
#> ✓ purrr     0.3.3
#> ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()    masks scales::discard()
#> x dplyr::filter()     masks stats::filter()
#> x dplyr::lag()        masks stats::lag()
#> x ggplot2::margin()   masks dials::margin()
#> x recipes::step()     masks stats::step()
#> x recipes::yj_trans() masks scales::yj_trans()
library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
#> The following object is masked from 'package:purrr':
#> 
#>     set_names

data("okc")
colnames(okc) <- tolower(names(okc))

okc <- sample_n(okc, 1000)

strata <- "class"
split <- initial_split(okc, prop = 0.8, strata = strata)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(strata)` instead of `strata` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.

train <- training(split)
test <- testing(split)

cv <- vfold_cv(
  train,
  v = 5,
  repeats = 1,
  strata = strata
)

recipe_info <- train %>%
  recipe() %>%

  ### Roles assignment
  update_role(strata, new_role = "outcome") %>% 
  update_role(-one_of(strata), new_role = "predictors")

recipe <- recipe_info %>% 
  ### Imputation
  # Numeric predictors
  step_medianimpute(all_numeric(), all_predictors()) %>%

  # Categorical predictors
  step_modeimpute(all_nominal(), -all_outcomes()) %>%

  # Time predictors
  # step_mutate_at(one_of(recipe_var_date), fn = as.factor) %>% 
  # step_unknown(date) %>% 
  # step_mutate_at(one_of(recipe_var_date), fn = as.Date) %>% 

  ### Handling time predictors
  step_date(date) %>%
  step_holiday(date) %>% 

  ### Individual transformations (optional)

  ### Lumping infrequent categories (optional)
  step_other(all_nominal(), -all_outcomes(), other = "infrequent") %>% 

  ### Dummyfying (optional features)
  step_dummy(all_nominal(), -all_outcomes()) %>% 

  ### Interactions (optional)

  ### Remove zero-variance predictors since they can't be normalized
  step_zv(all_predictors()) %>% 

  ### Normalization (optional)
  step_normalize(all_numeric()) %>% 

  ### Multivariate transformations (optional)

  ### Removing unnecessary predictors
  step_rm(date)

### Checks

# Check the structure of the input file
prep(recipe) %>% juice() %>% glimpse()
#> Observations: 801
#> Variables: 21
#> $ age                    <dbl> 3.01970931, 0.11936498, 1.01947184, -0.5807181…
#> $ height                 <dbl> -0.1134916, 0.1469863, 0.4074641, -1.4158809, …
#> $ class                  <fct> other, other, other, other, other, other, othe…
#> $ date_year              <dbl> 0.2818077, 0.2818077, 0.2818077, 0.2818077, 0.…
#> $ date_LaborDay          <dbl> -0.03533326, -0.03533326, -0.03533326, -0.0353…
#> $ diet_mostly.anything   <dbl> 0.6770844, 0.6770844, 0.6770844, -1.4750768, -…
#> $ diet_mostly.vegetarian <dbl> -0.2818077, -0.2818077, -0.2818077, -0.2818077…
#> $ diet_strictly.anything <dbl> -0.2969932, -0.2969932, -0.2969932, 3.3628769,…
#> $ diet_infrequent        <dbl> -0.266021, -0.266021, -0.266021, -0.266021, 3.…
#> $ location_oakland       <dbl> -0.3796124, -0.3796124, -0.3796124, 2.6309771,…
#> $ location_san.francisco <dbl> -1.0259291, 0.9735094, -1.0259291, -1.0259291,…
#> $ location_infrequent    <dbl> 1.5234010, -0.6556065, 1.5234010, -0.6556065, …
#> $ date_dow_Mon           <dbl> -0.2994726, -0.2994726, -0.2994726, -0.2994726…
#> $ date_dow_Tue           <dbl> -0.3397861, 2.9393541, -0.3397861, -0.3397861,…
#> $ date_dow_Wed           <dbl> -0.3212115, -0.3212115, -0.3212115, -0.3212115…
#> $ date_dow_Thu           <dbl> -0.3817598, -0.3817598, -0.3817598, -0.3817598…
#> $ date_dow_Fri           <dbl> -0.4953952, -0.4953952, -0.4953952, -0.4953952…
#> $ date_dow_Sat           <dbl> 1.6446739, -0.6072642, 1.6446739, -0.6072642, …
#> $ date_month_Jun         <dbl> 0.6420149, 0.6420149, 0.6420149, -1.5556516, -…
#> $ date_month_Jul         <dbl> -0.2466801, -0.2466801, -0.2466801, 4.0487718,…
#> $ date_month_infrequent  <dbl> -0.449944, -0.449944, -0.449944, -0.449944, -0…

model <- logistic_reg(
  penalty = tune(),
  mixture = tune()
) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

grid <- grid_max_entropy(
  penalty(),
  mixture(),
  size = 10,
  variogram_range = 1
)

workflow <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(model)

tuning <- tune_grid(
  workflow,
  resamples = cv,
  grid = grid,
  metrics = metric_set(roc_auc),
  control = control_grid(verbose = TRUE)
)
#> i Fold1: recipe
#> ✓ Fold1: recipe
#> i Fold1: model  1/10
#> ✓ Fold1: model  1/10
#> i Fold1: model  1/10 (predictions)
#> i Fold1: model  2/10
#> ✓ Fold1: model  2/10
#> i Fold1: model  2/10 (predictions)
#> i Fold1: model  3/10
#> ✓ Fold1: model  3/10
#> i Fold1: model  3/10 (predictions)
#> i Fold1: model  4/10
#> ✓ Fold1: model  4/10
#> i Fold1: model  4/10 (predictions)
#> i Fold1: model  5/10
#> ✓ Fold1: model  5/10
#> i Fold1: model  5/10 (predictions)
#> i Fold1: model  6/10
#> ✓ Fold1: model  6/10
#> i Fold1: model  6/10 (predictions)
#> i Fold1: model  7/10
#> ✓ Fold1: model  7/10
#> i Fold1: model  7/10 (predictions)
#> i Fold1: model  8/10
#> ✓ Fold1: model  8/10
#> i Fold1: model  8/10 (predictions)
#> i Fold1: model  9/10
#> ✓ Fold1: model  9/10
#> i Fold1: model  9/10 (predictions)
#> i Fold1: model 10/10
#> ✓ Fold1: model 10/10
#> i Fold1: model 10/10 (predictions)
#> i Fold2: recipe
#> ✓ Fold2: recipe
#> i Fold2: model  1/10
#> ✓ Fold2: model  1/10
#> i Fold2: model  1/10 (predictions)
#> i Fold2: model  2/10
#> ✓ Fold2: model  2/10
#> i Fold2: model  2/10 (predictions)
#> i Fold2: model  3/10
#> ✓ Fold2: model  3/10
#> i Fold2: model  3/10 (predictions)
#> i Fold2: model  4/10
#> ✓ Fold2: model  4/10
#> i Fold2: model  4/10 (predictions)
#> i Fold2: model  5/10
#> ✓ Fold2: model  5/10
#> i Fold2: model  5/10 (predictions)
#> i Fold2: model  6/10
#> ✓ Fold2: model  6/10
#> i Fold2: model  6/10 (predictions)
#> i Fold2: model  7/10
#> ✓ Fold2: model  7/10
#> i Fold2: model  7/10 (predictions)
#> i Fold2: model  8/10
#> ✓ Fold2: model  8/10
#> i Fold2: model  8/10 (predictions)
#> i Fold2: model  9/10
#> ✓ Fold2: model  9/10
#> i Fold2: model  9/10 (predictions)
#> i Fold2: model 10/10
#> ✓ Fold2: model 10/10
#> i Fold2: model 10/10 (predictions)
#> i Fold3: recipe
#> ✓ Fold3: recipe
#> i Fold3: model  1/10
#> ✓ Fold3: model  1/10
#> i Fold3: model  1/10 (predictions)
#> i Fold3: model  2/10
#> ✓ Fold3: model  2/10
#> i Fold3: model  2/10 (predictions)
#> i Fold3: model  3/10
#> ✓ Fold3: model  3/10
#> i Fold3: model  3/10 (predictions)
#> i Fold3: model  4/10
#> ✓ Fold3: model  4/10
#> i Fold3: model  4/10 (predictions)
#> i Fold3: model  5/10
#> ✓ Fold3: model  5/10
#> i Fold3: model  5/10 (predictions)
#> i Fold3: model  6/10
#> ✓ Fold3: model  6/10
#> i Fold3: model  6/10 (predictions)
#> i Fold3: model  7/10
#> ✓ Fold3: model  7/10
#> i Fold3: model  7/10 (predictions)
#> i Fold3: model  8/10
#> ✓ Fold3: model  8/10
#> i Fold3: model  8/10 (predictions)
#> i Fold3: model  9/10
#> ✓ Fold3: model  9/10
#> i Fold3: model  9/10 (predictions)
#> i Fold3: model 10/10
#> ✓ Fold3: model 10/10
#> i Fold3: model 10/10 (predictions)
#> i Fold4: recipe
#> ✓ Fold4: recipe
#> i Fold4: model  1/10
#> ✓ Fold4: model  1/10
#> i Fold4: model  1/10 (predictions)
#> i Fold4: model  2/10
#> ✓ Fold4: model  2/10
#> i Fold4: model  2/10 (predictions)
#> i Fold4: model  3/10
#> ✓ Fold4: model  3/10
#> i Fold4: model  3/10 (predictions)
#> i Fold4: model  4/10
#> ✓ Fold4: model  4/10
#> i Fold4: model  4/10 (predictions)
#> i Fold4: model  5/10
#> ✓ Fold4: model  5/10
#> i Fold4: model  5/10 (predictions)
#> i Fold4: model  6/10
#> ✓ Fold4: model  6/10
#> i Fold4: model  6/10 (predictions)
#> i Fold4: model  7/10
#> ✓ Fold4: model  7/10
#> i Fold4: model  7/10 (predictions)
#> i Fold4: model  8/10
#> ✓ Fold4: model  8/10
#> i Fold4: model  8/10 (predictions)
#> i Fold4: model  9/10
#> ✓ Fold4: model  9/10
#> i Fold4: model  9/10 (predictions)
#> i Fold4: model 10/10
#> ✓ Fold4: model 10/10
#> i Fold4: model 10/10 (predictions)
#> i Fold5: recipe
#> ✓ Fold5: recipe
#> i Fold5: model  1/10
#> ✓ Fold5: model  1/10
#> i Fold5: model  1/10 (predictions)
#> i Fold5: model  2/10
#> ✓ Fold5: model  2/10
#> i Fold5: model  2/10 (predictions)
#> i Fold5: model  3/10
#> ✓ Fold5: model  3/10
#> i Fold5: model  3/10 (predictions)
#> i Fold5: model  4/10
#> ✓ Fold5: model  4/10
#> i Fold5: model  4/10 (predictions)
#> i Fold5: model  5/10
#> ✓ Fold5: model  5/10
#> i Fold5: model  5/10 (predictions)
#> i Fold5: model  6/10
#> ✓ Fold5: model  6/10
#> i Fold5: model  6/10 (predictions)
#> i Fold5: model  7/10
#> ✓ Fold5: model  7/10
#> i Fold5: model  7/10 (predictions)
#> i Fold5: model  8/10
#> ✓ Fold5: model  8/10
#> i Fold5: model  8/10 (predictions)
#> i Fold5: model  9/10
#> ✓ Fold5: model  9/10
#> i Fold5: model  9/10 (predictions)
#> i Fold5: model 10/10
#> ✓ Fold5: model 10/10
#> i Fold5: model 10/10 (predictions)

Created on 2020-02-27 by the reprex package (v0.3.0)

konradsemsch commented 4 years ago

Let me maybe explain @topepo a little bit further what I'm working on:

I wasn't trying here to come up with the best recipe for this particular problem but I'm rather trying to work with recipes more programatically, so that a recipe could be abstracted from underlying data/ variable names (hence the vectors coming from summary(recipe)) and applied to different datasets with a single function without specifying all details. I used this particular dataset here only because it contains all different data types to start writing 'universal' recipe blueprints.

Btw: my original piece of code worked when I removed tuning, switched to glm as engine and simply run fit() on the workflow. Therefore, I still think that the problem is with glmnet/ tune and not my recipe.

  1. Usage of recipe_var_ - I want to distinguish between original variables and new ones created through the recipe. That gives me better control of what I would like to do with them in the course of the recipe. For instance, I do not want to apply step_dummy to all variables (incl. newly created binary variables that were factors before), but only the very original numeric ones. The bake looks then much more elegant and intuitive.

2 Using step_zv() - that wasn't necessary in my version of the recipe since, as I mentioned in 1), I was applying normalization only on the very original, numeric variables and not the ones created in the course of the recipe. Gut generally speaking you're obviously right, I would include that also in my my 'programmatic blueprint' to handle such cases.

Finally, I changes selectors that I think are intended to capture factors that are not the outcome from all_nominal(), all_predictors() to all_nominal(), -all_outcomes(). The former includes numeric variables too.

I was wondering about this behaviour and actually opened another issue about that: https://github.com/tidymodels/recipes/issues/471. Isn't that a bit counter-intuitive? To me it looks a bit as if the behaviour of combining selectors is not stable, as only predictors should be selected when all_predictors() comes into play right?

konradsemsch commented 4 years ago

Ok, I modified my original recipe and now it started working with glmnet. What makes sense:

  1. Handling any time predictors at the very beginning so that they can be handled by numeric/ nominal imputation straight away.

  2. I had to use -all_outcomes() in step_medianimpute() which I find a little bit weird. I had used all_predictors() before and it seems as if the recipe was picking the target variable over there.

# Building the actual recipe
recipe <- recipe_info %>% 

  ### Handling time predictors
  step_date(one_of(recipe_var_date)) %>%
  step_holiday(one_of(recipe_var_date)) %>% 
  step_rm(has_type("date")) %>% 

  ### Imputation
  # Numeric predictors
  step_medianimpute(all_numeric(), -all_outcomes()) %>%

  # Categorical predictors
  step_modeimpute(all_nominal(), -all_outcomes()) %>%

  ### Lumping infrequent categories (optional)
  step_other(all_nominal(), -all_outcomes(), other = "infrequent") %>% 

  ### Removing zero-variance predictors (optional)

  ### Individual transformations (optional)

  ### Dummyfying (optional features)
  step_dummy(all_nominal(), -all_outcomes()) %>% 

  ### Interactions (optional)

  ### Normalization (optional)
  step_normalize(one_of(recipe_var_numeric)) 

  ### Multivariate transformations (optional)

  ### Removing unnecessary predictors

  ### Checks
github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.