topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

caret::train() function taking 50+ second to execute #1284

Closed sumongithub closed 2 years ago

sumongithub commented 2 years ago

Hello,

I'm using the caret::train() function with K-Fold cross-validation (example given below). There is a dataset of 1470 rows and 40 columns. one target variable and 38 X variables.

execution time of more than 50 second-

train_control <- trainControl(method = "cv", number = 5)
system.time(
  model_lm <- train(YearsAtCompany~. -EmployeeNumber, 
                  data = hrdatanew, 
                  methods = "lm",
                  trControl = train_control)
)

#   user  system elapsed 
# 50.107   0.319  50.406 

The output of the model is showing Random Forrest but provided method is "lm"

model_lm

# Random Forest 

# 1470 samples
#  39 predictor

# No pre-processing
# Resampling: Cross-Validated (5 fold) 
# Summary of sample sizes: 1175, 1177, 1175, 1176, 1177 
# Resampling results across tuning parameters:

#  mtry  RMSE      Rsquared   MAE     
#  2    3.383678  0.7809184  2.190747
#  20    2.246496  0.8668412  1.178069
#  38    2.237824  0.8671793  1.192514

# RMSE was used to select the optimal model using the smallest value.
# The final value used for the model was mtry = 38.

Almost the same execution time was observed without K-Fold as well

I'm using Macbook Pro Max 64GB/32 Core GPU

Please find below session information-

Session Info:

>sessionInfo()

R version 4.1.3 (2022-03-10) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Monterey 12.3.1

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] caret_6.0-92 lattice_0.20-45 Metrics_0.1.4 car_3.0-12 carData_3.0-5 rsample_0.1.1
[7] fastDummies_1.6.3 gridExtra_2.3 ggplot2_3.3.5 readxl_1.4.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.8.3 lubridate_1.8.0 tidyr_1.2.0 listenv_0.8.0 class_7.3-20
[6] assertthat_0.2.1 digest_0.6.29 ipred_0.9-12 foreach_1.5.2 utf8_1.2.2
[11] parallelly_1.31.0 R6_2.5.1 cellranger_1.1.0 plyr_1.8.7 hardhat_0.2.0
[16] stats4_4.1.3 evaluate_0.15 pillar_1.7.0 rlang_1.0.2 rstudioapi_0.13
[21] data.table_1.14.2 furrr_0.2.3 rpart_4.1.16 Matrix_1.4-1 rmarkdown_2.13
[26] labeling_0.4.2 splines_4.1.3 gower_1.0.0 stringr_1.4.0 munsell_0.5.0
[31] compiler_4.1.3 xfun_0.30 pkgconfig_2.0.3 globals_0.14.0 htmltools_0.5.2
[36] nnet_7.3-17 tidyselect_1.1.2 tibble_3.1.6 prodlim_2019.11.13 codetools_0.2-18
[41] randomForest_4.7-1 fansi_1.0.3 future_1.24.0 crayon_1.5.1 dplyr_1.0.8
[46] withr_2.5.0 MASS_7.3-56 recipes_0.2.0 ModelMetrics_1.2.2.2 grid_4.1.3
[51] nlme_3.1-157 gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.3
[56] pROC_1.18.0 scales_1.2.0 future.apply_1.8.1 cli_3.2.0 stringi_1.7.6
[61] farver_2.1.0 reshape2_1.4.4 timeDate_3043.102 ellipsis_0.3.2 generics_0.1.2
[66] vctrs_0.4.1 lava_1.6.10 iterators_1.0.14 tools_4.1.3 glue_1.6.2
[71] purrr_0.3.4 abind_1.4-5 parallel_4.1.3 fastmap_1.1.0 survival_3.3-1
[76] yaml_2.3.5 colorspace_2.0-3 knitr_1.38
Modify Chunk OptionsRun All Chunks AboveRun Current Chunk You can delete the text in each section that explains how to do it correctly. Be sure to test your 2 chunks of code in an empty R session before submitting your issue!