topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

something went wrong with recipes object as the input for train() #804

Closed shinhongwu closed 6 years ago

shinhongwu commented 6 years ago

Dear Max,

I am very happy to know Hadley and you had developed the new package - recipes to deal with the hideous pre-process and feature engineering for modeling. I noticed you had announced to integrate recipes packages better with caret packages in your next step from recipes depository github. Shortly, I found caret 6.0-73 has rolled out with example to use recipes object as the input in train(). I tried it this morning and got the following error: Error in train.default(cox2_recipe, data = cox2, method = "lm", trControl = trainControl(method = "cv")) : argument "y" is missing, with no default

don't know if it is the problem of my computer or is it really a bug. The following is the codes and my session info.

> library(recipes)
> data(cox2)
> cox2 <- cox2Descr
> cox2$potency <- cox2IC50
> 
> cox2_recipe <- recipe(potency ~ ., data = cox2) %>%
+   ## Log the outcome
+   step_log(potency, base = 10) %>%
+   ## Remove sparse and unbalanced predictors
+   step_nzv(all_predictors()) %>%
+   ## Surface area predictors are highly correlated so
+   ## conduct PCA just on these.
+   step_pca(contains("VSA"), prefix = "surf_area_",
+            threshold = .95) %>%
+   ## Remove other highly correlated predictors
+   step_corr(all_predictors(), -starts_with("surf_area_"),
+             threshold = .90) %>%
+   ## Center and scale all of the non-PCA predictors
+   step_center(all_predictors(), -starts_with("surf_area_")) %>%
+   step_scale(all_predictors(), -starts_with("surf_area_"))
> cox2_lm <- train(cox2_recipe,
+                  data = cox2,
+                  method = "lm",
+                  trControl = trainControl(method = "cv"))
Error in train.default(cox2_recipe, data = cox2, method = "lm", trControl = trainControl(method = "cv")) : 
  argument "y" is missing, with no default

### Session Info:
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950  LC_CTYPE=Chinese (Traditional)_Taiwan.950   
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C                                
[5] LC_TIME=Chinese (Traditional)_Taiwan.950    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] recipes_0.1.1.9000   broom_0.4.2          dplyr_0.5.0          caret_6.0-73         ggplot2_2.2.1       
[6] lattice_0.20-34      RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9           lubridate_1.6.0       tidyr_0.6.1           class_7.3-14         
 [5] assertthat_0.1        ipred_0.9-6           psych_1.6.12          foreach_1.4.3        
 [9] R6_2.2.0              plyr_1.8.4            MatrixModels_0.4-1    stats4_3.3.3         
[13] rlang_0.1.4.9000      lazyeval_0.2.0        minqa_1.2.4           SparseM_1.76         
[17] car_2.1-4             nloptr_1.0.4          kernlab_0.9-25        rpart_4.1-10         
[21] Matrix_1.2-8          splines_3.3.3         RevoUtils_10.0.3      lme4_1.1-12          
[25] CVST_0.2-1            ddalpha_1.2.1         gower_0.1.2           stringr_1.2.0        
[29] foreign_0.8-67        munsell_0.4.3         mnormt_1.5-5          dimRed_0.1.0.9001    
[33] mgcv_1.8-17           nnet_7.3-12           tidyselect_0.2.3.9000 tibble_1.2           
[37] prodlim_1.6.1         DRR_0.0.2             codetools_0.2-15      RcppRoll_0.2.2       
[41] MASS_7.3-45           ModelMetrics_1.1.0    grid_3.3.3            nlme_3.1-131         
[45] gtable_0.2.0          DBI_0.6               magrittr_1.5          pROC_1.9.1           
[49] scales_0.4.1          stringi_1.1.2         reshape2_1.4.2        timeDate_3012.100    
[53] robustbase_0.92-7     lava_1.4.7            iterators_1.0.8       tools_3.3.3          
[57] glue_1.2.0.9000       DEoptimR_1.0-8        purrr_0.2.2           parallel_3.3.3       
[61] pbkrtest_0.4-7        survival_2.40-1       yaml_2.1.14           colorspace_1.3-2     
[65] knitr_1.15.1          quantreg_5.29    

maurice

topepo commented 6 years ago

I'm not sure. I tried it with the current recipes and things worked:

> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> library(recipes)
Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: broom

Attaching package: ‘recipes’

The following object is masked from ‘package:stats’:

    step

> data(cox2)
> cox2 <- cox2Descr
> cox2$potency <- cox2IC50
> 
> cox2_recipe <- recipe(potency ~ ., data = cox2) %>%
+   ## Log the outcome
+   step_log(potency, base = 10) %>%
+   ## Remove sparse and unbalanced predictors
+   step_nzv(all_predictors()) %>%
+   ## Surface area predictors are highly correlated so
+   ## conduct PCA just on these.
+   step_pca(contains("VSA"), prefix = "surf_area_",
+            threshold = .95) %>%
+   ## Remove other highly correlated predictors
+   step_corr(all_predictors(), -starts_with("surf_area_"),
+             threshold = .90) %>%
+   ## Center and scale all of the non-PCA predictors
+   step_center(all_predictors(), -starts_with("surf_area_")) %>%
+   step_scale(all_predictors(), -starts_with("surf_area_"))
> 
> cox2_lm <- train(cox2_recipe,
+                  data = cox2,
+                  method = "lm",
+                  trControl = trainControl(method = "cv"))
> cox2_lm
Linear Regression 

462 samples
255 predictors

Recipe steps: log, nzv, pca, corr, center, scale 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 417, 416, 416, 415, 414, 416, ... 
Resampling results:

  RMSE      Rsquared   MAE      
  1.161915  0.3889096  0.8710393

Tuning parameter 'intercept' was held constant at a value of TRUE
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] recipes_0.1.2   broom_0.4.3     dplyr_0.7.4     caret_6.0-78    ggplot2_2.2.1   lattice_0.20-35

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3   purrr_0.2.4        reshape2_1.4.3     kernlab_0.9-25     splines_3.4.3     
 [6] colorspace_1.3-2   stats4_3.4.3       yaml_2.1.16        survival_2.41-3    prodlim_1.6.1     
[11] rlang_0.1.6.9003   ModelMetrics_1.1.0 pillar_1.1.0       withr_2.1.1        foreign_0.8-69    
[16] glue_1.2.0         bindrcpp_0.2       foreach_1.4.4      bindr_0.1          plyr_1.8.4        
[21] dimRed_0.1.0       lava_1.6           robustbase_0.92-8  stringr_1.2.0      timeDate_3042.101 
[26] munsell_0.4.3      gtable_0.2.0       codetools_0.2-15   psych_1.7.8        parallel_3.4.3    
[31] class_7.3-14       DEoptimR_1.0-8     Rcpp_0.12.15       scales_0.5.0       ipred_0.9-6       
[36] CVST_0.2-1         mnormt_1.5-5       stringi_1.1.6      RcppRoll_0.2.2     ddalpha_1.3.1     
[41] grid_3.4.3         tools_3.4.3        magrittr_1.5       lazyeval_0.2.1     tibble_1.4.2      
[46] tidyr_0.7.2        DRR_0.0.3          pkgconfig_2.0.1    MASS_7.3-47        Matrix_1.2-12     
[51] lubridate_1.7.1    gower_0.1.2        assertthat_0.2.0   iterators_1.0.9    R6_2.2.2          
[56] rpart_4.1-12       sfsmisc_1.1-1      nnet_7.3-12        nlme_3.1-131       compiler_3.4.3