tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
569 stars 112 forks source link

recipe steps fail when running in parallel #847

Closed markjrieke closed 3 years ago

markjrieke commented 3 years ago

Hi! I ran into an issue when trying to fit resamples with a recipe using bestNormalize::step_best_normalize() & getting the error: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_best_normalize', 'step')". I ran into this issue from tune suggesting that the error was due to running in parallel. When I ran sequentially, it worked! That issue (& the recipes changelog) mention that the issue should be resolved in recipes ver 0.1.15 - I have ver 0.1.16 installed, so thought I might point out the potential existing bug!

Here's a manual recreation of the error (it's pretty late as I'm writing this, so I didn't want to wait for reprex to eval):

# splits & whatnot
diamonds_split <- initial_split(diamonds)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

diamonds_folds <- vfold_cv(diamonds_train)

# runs!
linear_mod <- 
  linear_reg() %>%
  set_engine("lm")

linear_rec <-
  recipe(price ~ ., data = diamonds_train) %>%
  bestNormalize::step_best_normalize(carat) %>%
  step_dummy(all_nominal_predictors())

linear_wf <-
  workflow() %>%
  add_model(linear_mod) %>%
  add_recipe(linear_rec)

linear_rs <-
  fit_resamples(
    linear_wf,
    diamonds_folds
  )

# fails!
doParallel::registerDoParallel()

linear_rs <-
  fit_resamples(
    linear_wf,
    diamonds_folds
  )
juliasilge commented 3 years ago

I don't see an error on this, running in parallel:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

diamonds_split <- initial_split(diamonds)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

diamonds_folds <- vfold_cv(diamonds_train, v = 3)

linear_mod <- 
  linear_reg() %>%
  set_engine("lm")

linear_rec <-
  recipe(price ~ ., data = diamonds_train) %>%
  bestNormalize::step_best_normalize(carat) %>%
  step_dummy(all_nominal_predictors())

linear_wf <- workflow(linear_rec, linear_mod)
doParallel::registerDoParallel()
fit_resamples(linear_wf, diamonds_folds)
#> # Resampling results
#> # 3-fold cross-validation 
#> # A tibble: 3 × 4
#>   splits                id    .metrics         .notes          
#>   <list>                <chr> <list>           <list>          
#> 1 <split [26970/13485]> Fold1 <tibble [2 × 4]> <tibble [0 × 1]>
#> 2 <split [26970/13485]> Fold2 <tibble [2 × 4]> <tibble [0 × 1]>
#> 3 <split [26970/13485]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]>

Created on 2021-11-04 by the reprex package (v2.0.1)

Can you share some more information on your setup? Can you run the reprex above and include session_info = TRUE so we can check out what package versions you are using?

markjrieke commented 3 years ago

Hi Julia, thanks for checking in - here's the reprex:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

diamonds_split <- initial_split(diamonds)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

diamonds_folds <- vfold_cv(diamonds_train, v = 3)

linear_mod <-
  linear_reg() %>%
  set_engine("lm")

linear_rec <-
  recipe(~price ~ ., data = diamonds_train) %>%
  bestNormalize::step_best_normalize(carat) %>%
  step_dummy(all_nominal_predictors())

linear_wf <-
  workflow() %>%
  add_model(linear_mod) %>%
  add_recipe(linear_rec)

doParallel::registerDoParallel()

linear_rs <-
  fit_resamples(
    linear_wf,
    diamonds_folds
  )
#> Warning: All models failed. See the `.notes` column.

linear_rs
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> preprocessor 1/1: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_best_normalize', 'step')"
#> preprocessor 1/1: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_best_normalize', 'step')"
#> preprocessor 1/1: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_best_normalize', 'step')"
#> # Resampling results
#> # 3-fold cross-validation 
#> # A tibble: 3 x 4
#>   splits                id    .metrics .notes          
#>   <list>                <chr> <list>   <list>          
#> 1 <split [26970/13485]> Fold1 <NULL>   <tibble [1 x 1]>
#> 2 <split [26970/13485]> Fold2 <NULL>   <tibble [1 x 1]>
#> 3 <split [26970/13485]> Fold3 <NULL>   <tibble [1 x 1]>

Created on 2021-11-04 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.1.0 (2021-05-18) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/Chicago #> date 2021-11-04 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) #> BBmisc 1.11 2017-03-10 [1] CRAN (R 4.1.0) #> bestNormalize 1.8.2 2021-09-16 [1] CRAN (R 4.1.1) #> broom * 0.7.7 2021-06-13 [1] CRAN (R 4.1.0) #> butcher 0.1.5 2021-06-28 [1] CRAN (R 4.1.1) #> checkmate 2.0.0 2020-02-06 [1] CRAN (R 4.1.0) #> class 7.3-19 2021-05-03 [1] CRAN (R 4.1.0) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) #> codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.0) #> colorspace 2.0-1 2021-05-04 [1] CRAN (R 4.1.0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> data.table 1.14.0 2021-02-21 [1] CRAN (R 4.1.0) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) #> dials * 0.0.9 2020-09-16 [1] CRAN (R 4.1.0) #> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.1.0) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) #> doParallel 1.0.16 2020-10-16 [1] CRAN (R 4.1.0) #> doRNG 1.8.2 2020-01-27 [1] CRAN (R 4.1.1) #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmatch 1.1-3 2021-07-23 [1] CRAN (R 4.1.0) #> FNN 1.1.3 2019-02-15 [1] CRAN (R 4.1.0) #> foreach 1.5.1 2020-10-15 [1] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) #> furrr 0.2.3 2021-06-25 [1] CRAN (R 4.1.0) #> future 1.21.0 2020-12-10 [1] CRAN (R 4.1.0) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) #> ggplot2 * 3.3.4 2021-06-16 [1] CRAN (R 4.1.0) #> globals 0.14.0 2020-11-22 [1] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> gower 0.2.2 2020-06-23 [1] CRAN (R 4.1.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.1.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) #> infer * 0.5.4 2021-01-13 [1] CRAN (R 4.1.0) #> ipred 0.9-11 2021-03-12 [1] CRAN (R 4.1.0) #> iterators 1.0.13 2020-10-15 [1] CRAN (R 4.1.0) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) #> lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0) #> lava 1.6.9 2021-03-11 [1] CRAN (R 4.1.0) #> lhs 1.1.1 2020-10-05 [1] CRAN (R 4.1.0) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) #> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.1.0) #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> MASS 7.3-54 2021-05-03 [1] CRAN (R 4.1.0) #> Matrix 1.3-3 2021-05-04 [1] CRAN (R 4.1.0) #> mlr 2.19.0 2021-02-22 [1] CRAN (R 4.1.0) #> modeldata * 0.1.0 2020-10-22 [1] CRAN (R 4.1.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) #> nnet 7.3-16 2021-05-03 [1] CRAN (R 4.1.0) #> parallelly 1.27.0 2021-07-19 [1] CRAN (R 4.1.0) #> parallelMap 1.5.1 2021-06-28 [1] CRAN (R 4.1.0) #> ParamHelpers 1.14 2020-03-24 [1] CRAN (R 4.1.0) #> parsnip * 0.1.6 2021-05-27 [1] CRAN (R 4.1.0) #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.0) #> pROC 1.17.0.1 2021-01-13 [1] CRAN (R 4.1.0) #> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.1.0) #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) #> RANN 2.6.1 2019-01-08 [1] CRAN (R 4.1.0) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) #> recipes * 0.1.16 2021-04-16 [1] CRAN (R 4.1.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0) #> rngtools 1.5.2 2021-09-20 [1] CRAN (R 4.1.1) #> ROSE 0.0-4 2021-06-14 [1] CRAN (R 4.1.0) #> rpart 4.1-15 2019-04-12 [1] CRAN (R 4.1.0) #> rsample * 0.1.0 2021-05-08 [1] CRAN (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> scales * 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> stringi 1.6.1 2021-05-10 [1] CRAN (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> survival 3.2-11 2021-04-26 [1] CRAN (R 4.1.0) #> themis * 0.1.4 2021-06-12 [1] CRAN (R 4.1.0) #> tibble * 3.1.2 2021-05-16 [1] CRAN (R 4.1.0) #> tidymodels * 0.1.3 2021-04-19 [1] CRAN (R 4.1.0) #> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> timeDate 3043.102 2018-02-21 [1] CRAN (R 4.1.0) #> tune * 0.1.5 2021-04-23 [1] CRAN (R 4.1.0) #> unbalanced 2.0 2015-06-26 [1] CRAN (R 4.1.0) #> usethis 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> workflows * 0.2.2 2021-03-10 [1] CRAN (R 4.1.0) #> workflowsets * 0.0.2 2021-04-16 [1] CRAN (R 4.1.0) #> xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> yardstick * 0.0.8 2021-03-28 [1] CRAN (R 4.1.0) #> #> [1] C:/Users/E1735399/Documents/R/R-4.1.0/library ```

I haven't run a blanket package update, but I started working with a new laptop in July, so everything should be fairly up-to-date.

juliasilge commented 3 years ago

I see that you are on Windows, and I don't believe that doParallel::registerDoParallel() will work on Windows like that. Have you tried something like PSOCK clusters, as shown here?

markjrieke commented 3 years ago

Hey Julia - I haven’t, but I’ll give it a shot when I get a chance! Do you mind elaborating on what you mean by doParallel::registerDoParallel not being able to work on Windows like that? I’ve used ‘doParallel::registerDoParallel’ w/o issue before & the documentation does mention Windows use. Just want to make sure I can avoid any similar issues in the future! Thanks again for all your help!!

juliasilge commented 3 years ago

I am extremely not an expert on parallel processing on Windows, I'm afraid, so I am not the best person to answer in detail. Some things to try:

library(doParallel)
cl <- makePSOCKcluster(2)  ## or how many you want
registerDoParallel(cl)

I believe most people who use tidymodels on Windows have more success with PSOCK clusters for parallelization.

markjrieke commented 3 years ago

Hey Julia - no worries; definitely appreciate all the advice! I'll look into PSOCK clusters in the future for running in parallel. I'll go ahead & close since it looks like this is unrelated to recipes. Thanks again!!

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.