topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

XGBoost parallelization fails when twoClassSummary is the first summaryFunc used #1106

Open khughitt opened 4 years ago

khughitt commented 4 years ago

Overview

When using caret to train an XGBoost model with parallelization at the resample level, execution continues indefinitely (or at least, for more than several hours in my tests), when the first summaryFunc used is of type twoClassSummary.

For the same code, if a call is first made to train() on any arbitrary data, however, the executions proceeds as expected, and in the case of the example below, finishes in seconds.

This effect does not appear to depend on the parallelization back-end used; tested with:

It also occurs if the twoClassSummary is used along-side of other summary functions as part of a yardstick metric_set().

Based on this behavior, I'm guessing there is some kind of initialization of the interface with the parallelization back-end that normally occurs during the first invocation to train() which is not taking place as-expected, in the case below.

Minimal, reproducible example:

To see the behavior in action, first run the code below as-is, and (at least in the linux environment I'm testing it in), it will likely continue running for a long time.

Next, uncomment the commented first call to train() and re-run the script. It will execute in parallel and finish quickly as expected.

library(caret)
set.seed(1)

# create fake data
test_dat <- twoClassSim(20)

library("doFuture")
registerDoFuture()
plan(multiprocess, workers = parallel::detectCores() - 1)

# fast, when the lines below are uncommented...
# mod <- train(Class ~ ., data = test_dat, method = "xgbTree",
#              tuneLength = 1, nthread = 1,
#              trControl = trainControl(search = "random"))
# message("warm-up done...")

train_control <- trainControl(search = 'random',
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

t1 <- Sys.time()
mod <- train(Class ~ ., data = test_dat, method = "xgbTree", tuneLength = 10,
             nthreads = 1, trControl = train_control,
             metric = 'ROC')
t2 <- Sys.time()
t2 - t1

# with warm-up: ~2 seconds
# without warm-up: > 3hrs

Session Info:

Tested for two different versions of R/caret the same Linux machine:

R 3.6.1

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libopenblasp-r0.3.7.so
LAPACK: /usr/lib/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doFuture_0.8.2   iterators_1.0.12 foreach_1.4.7    future_1.15.1    globals_0.12.5  
[6] caret_6.0-84     ggplot2_3.2.1    lattice_0.20-38 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3         pillar_1.4.2       compiler_3.6.1     gower_0.2.1        plyr_1.8.5        
 [6] tools_3.6.1        class_7.3-15       digest_0.6.23      rpart_4.1-15       ipred_0.9-9       
[11] lubridate_1.7.4    lifecycle_0.1.0    tibble_2.1.3       nlme_3.1-143       gtable_0.3.0      
[16] pkgconfig_2.0.3    rlang_0.4.2        Matrix_1.2-17      prodlim_2019.11.13 e1071_1.7-3       
[21] stringr_1.4.0      withr_2.1.2        dplyr_0.8.3        generics_0.0.2     recipes_0.1.7     
[26] xgboost_0.90.0.2   stats4_3.6.1       grid_3.6.1         nnet_7.3-12        tidyselect_0.2.5  
[31] data.table_1.12.8  glue_1.3.1         listenv_0.8.0      R6_2.4.1           survival_3.1-8    
[36] lava_1.6.6         reshape2_1.4.3     purrr_0.3.3        magrittr_1.5       ModelMetrics_1.2.2
[41] scales_1.1.0       codetools_0.2-16   MASS_7.3-51.4      splines_3.6.1      assertthat_0.2.1  
[46] timeDate_3043.102  colorspace_1.4-1   stringi_1.4.3      lazyeval_0.2.2     munsell_0.5.0     
[51] crayon_1.3.4      

R nightly (2019-12-10)

R Under development (unstable) (2019-12-10 r77549)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libopenblasp-r0.3.7.so
LAPACK: /usr/lib/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doFuture_0.8.2   iterators_1.0.12 foreach_1.4.7    future_1.15.1    globals_0.12.5  
[6] caret_6.0-84     ggplot2_3.2.1    lattice_0.20-38 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3         pillar_1.4.2       compiler_4.0.0     gower_0.2.1        plyr_1.8.5        
 [6] tools_4.0.0        class_7.3-15       digest_0.6.23      rpart_4.1-15       ipred_0.9-9       
[11] lubridate_1.7.4    lifecycle_0.1.0    tibble_2.1.3       nlme_3.1-143       gtable_0.3.0      
[16] pkgconfig_2.0.3    rlang_0.4.2        Matrix_1.2-18      prodlim_2019.11.13 e1071_1.7-3       
[21] stringr_1.4.0      withr_2.1.2        dplyr_0.8.3        generics_0.0.2     recipes_0.1.7     
[26] xgboost_0.90.0.2   stats4_4.0.0       grid_4.0.0         nnet_7.3-12        tidyselect_0.2.5  
[31] data.table_1.12.8  glue_1.3.1         listenv_0.8.0      R6_2.4.1           survival_3.1-8    
[36] lava_1.6.6         reshape2_1.4.3     purrr_0.3.3        magrittr_1.5       ModelMetrics_1.2.2
[41] scales_1.1.0       codetools_0.2-16   MASS_7.3-51.4      splines_4.0.0      assertthat_0.2.1  
[46] timeDate_3043.102  colorspace_1.4-1   stringi_1.4.3      lazyeval_0.2.2     munsell_0.5.0     
[51] crayon_1.3.4      
topepo commented 4 years ago

I also see a difference but my times aren't nearly so bad (at least on my setup, see below). I would stay away form benchmarking with r-devel unless it is very close to release.

Perhaps @HenrikBengtsson might have an idea.

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
set.seed(1)

# create fake data
test_dat <- twoClassSim(20)

library("doFuture")
#> Loading required package: globals
#> Loading required package: future
#> 
#> Attaching package: 'future'
#> The following object is masked from 'package:caret':
#> 
#>     cluster
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
registerDoFuture()
plan(multiprocess, workers = parallel::detectCores() - 1)
#> Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
#> in future (>= 1.13.0) when running R from RStudio, because it is
#> considered unstable. Because of this, plan("multicore") will fall
#> back to plan("sequential"), and plan("multiprocess") will fall back to
#> plan("multisession") - not plan("multicore") as in the past. For more details,
#> how to control forked processing or not, and how to silence this warning in
#> future R sessions, see ?future::supportsMulticore

system.time({
  set.seed(252)
  mod <- train(
    Class ~ .,
    data = test_dat,
    method = "xgbTree",
    tuneLength = 1,
    nthread = 1,
    trControl = trainControl(search = "random")
  )
})
#>    user  system elapsed 
#>   0.739   0.050  27.367

train_control <- trainControl(search = 'random',
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

system.time({
  set.seed(252)
  mod <-
    train(
      Class ~ .,
      data = test_dat,
      method = "xgbTree",
      tuneLength = 10,
      nthreads = 1,
      trControl = train_control,
      metric = 'ROC'
    )
})
#> Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
#> There were missing values in resampled performance measures.
#>    user  system elapsed 
#>   0.905   0.040   2.196

Created on 2020-01-01 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.1 (2019-07-05) #> os macOS Mojave 10.14.6 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2020-01-01 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) #> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0) #> callr 3.4.0 2019-12-09 [1] CRAN (R 3.6.0) #> caret * 6.0-84.9000 2020-01-02 [1] Github (topepo/caret@1a2fcb2) #> class 7.3-15 2019-01-01 [1] CRAN (R 3.6.0) #> cli 2.0.0 2019-12-09 [1] CRAN (R 3.6.0) #> codetools 0.2-16 2018-12-24 [1] CRAN (R 3.6.1) #> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) #> data.table 1.12.6 2019-10-18 [1] CRAN (R 3.6.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) #> devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.0) #> dials 0.0.4 2019-12-02 [1] CRAN (R 3.6.1) #> DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 3.6.0) #> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) #> doFuture * 0.8.1 2019-07-20 [1] CRAN (R 3.6.0) #> dplyr 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) #> e1071 1.7-3 2019-11-26 [1] CRAN (R 3.6.0) #> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) #> fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.0) #> foreach * 1.4.7 2019-07-27 [1] CRAN (R 3.6.0) #> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) #> future * 1.15.1 2019-11-25 [1] CRAN (R 3.6.0) #> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0) #> ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0) #> globals * 0.12.5 2019-12-07 [1] CRAN (R 3.6.0) #> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) #> gower 0.2.1 2019-05-14 [1] CRAN (R 3.6.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 3.6.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0) #> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0) #> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0) #> ipred 0.9-9 2019-04-28 [1] CRAN (R 3.6.0) #> iterators * 1.0.12 2019-07-26 [1] CRAN (R 3.6.0) #> knitr 1.26 2019-11-12 [1] CRAN (R 3.6.0) #> lattice * 0.20-38 2018-11-04 [1] CRAN (R 3.6.1) #> lava 1.6.6 2019-08-01 [1] CRAN (R 3.6.0) #> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0) #> lhs 1.0.1 2019-02-03 [1] CRAN (R 3.6.0) #> lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.0) #> listenv 0.8.0 2019-12-05 [1] CRAN (R 3.6.1) #> lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) #> MASS 7.3-51.5 2019-12-20 [1] CRAN (R 3.6.0) #> Matrix 1.2-18 2019-11-27 [1] CRAN (R 3.6.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) #> ModelMetrics 1.2.2 2018-11-03 [1] CRAN (R 3.6.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0) #> nlme 3.1-140 2019-05-12 [1] CRAN (R 3.6.1) #> nnet 7.3-12 2016-02-02 [1] CRAN (R 3.6.0) #> parsnip 0.0.4.9000 2019-12-14 [1] local #> pillar 1.4.3 2019-12-20 [1] Github (r-lib/pillar@e2e7926) #> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) #> plyr 1.8.5 2019-12-10 [1] CRAN (R 3.6.0) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0) #> pROC 1.15.3 2019-07-21 [1] CRAN (R 3.6.0) #> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.0) #> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 3.6.0) #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) #> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) #> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0) #> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.0) #> recipes 0.1.8 2019-12-18 [1] local #> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0) #> reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.6.0) #> rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.0) #> rmarkdown 2.0 2019-12-12 [1] CRAN (R 3.6.0) #> rpart 4.1-15 2019-04-12 [1] CRAN (R 3.6.0) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) #> scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) #> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) #> survival 2.40-1 2016-10-30 [1] CRAN (R 3.6.1) #> testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.0) #> tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0) #> tidyr 1.0.0 2019-09-11 [1] CRAN (R 3.6.0) #> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0) #> timeDate 3043.102 2018-02-21 [1] CRAN (R 3.6.0) #> usethis 1.5.1.9000 2019-12-18 [1] Github (r-lib/usethis@b2e894e) #> vctrs 0.2.1 2019-12-17 [1] CRAN (R 3.6.1) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) #> workflows 0.1.0 2019-12-30 [1] CRAN (R 3.6.1) #> xfun 0.11 2019-11-12 [1] CRAN (R 3.6.0) #> xgboost 0.90.0.2 2019-08-01 [1] CRAN (R 3.6.0) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0) #> zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library ```

and

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
set.seed(1)

# create fake data
test_dat <- twoClassSim(20)

library("doFuture")
#> Loading required package: globals
#> Loading required package: future
#> 
#> Attaching package: 'future'
#> The following object is masked from 'package:caret':
#> 
#>     cluster
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
registerDoFuture()
plan(multiprocess, workers = parallel::detectCores() - 1)
#> Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
#> in future (>= 1.13.0) when running R from RStudio, because it is
#> considered unstable. Because of this, plan("multicore") will fall
#> back to plan("sequential"), and plan("multiprocess") will fall back to
#> plan("multisession") - not plan("multicore") as in the past. For more details,
#> how to control forked processing or not, and how to silence this warning in
#> future R sessions, see ?future::supportsMulticore

# system.time({
#   set.seed(252)
#   mod <- train(
#     Class ~ .,
#     data = test_dat,
#     method = "xgbTree",
#     tuneLength = 1,
#     nthread = 1,
#     trControl = trainControl(search = "random")
#   )
# })

train_control <- trainControl(search = 'random',
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

system.time({
  set.seed(252)
  mod <-
    train(
      Class ~ .,
      data = test_dat,
      method = "xgbTree",
      tuneLength = 10,
      nthreads = 1,
      trControl = train_control,
      metric = 'ROC'
    )
})
#> Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
#> There were missing values in resampled performance measures.
#>    user  system elapsed 
#>   0.782   0.047  27.792

Created on 2020-01-01 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.1 (2019-07-05) #> os macOS Mojave 10.14.6 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2020-01-01 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) #> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0) #> callr 3.4.0 2019-12-09 [1] CRAN (R 3.6.0) #> caret * 6.0-84.9000 2020-01-02 [1] Github (topepo/caret@1a2fcb2) #> class 7.3-15 2019-01-01 [1] CRAN (R 3.6.0) #> cli 2.0.0 2019-12-09 [1] CRAN (R 3.6.0) #> codetools 0.2-16 2018-12-24 [1] CRAN (R 3.6.1) #> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) #> data.table 1.12.6 2019-10-18 [1] CRAN (R 3.6.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) #> devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.0) #> dials 0.0.4 2019-12-02 [1] CRAN (R 3.6.1) #> DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 3.6.0) #> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) #> doFuture * 0.8.1 2019-07-20 [1] CRAN (R 3.6.0) #> dplyr 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) #> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) #> fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.0) #> foreach * 1.4.7 2019-07-27 [1] CRAN (R 3.6.0) #> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) #> future * 1.15.1 2019-11-25 [1] CRAN (R 3.6.0) #> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0) #> ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0) #> globals * 0.12.5 2019-12-07 [1] CRAN (R 3.6.0) #> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) #> gower 0.2.1 2019-05-14 [1] CRAN (R 3.6.0) #> GPfit 1.0-8 2019-02-08 [1] CRAN (R 3.6.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0) #> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0) #> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0) #> ipred 0.9-9 2019-04-28 [1] CRAN (R 3.6.0) #> iterators * 1.0.12 2019-07-26 [1] CRAN (R 3.6.0) #> knitr 1.26 2019-11-12 [1] CRAN (R 3.6.0) #> lattice * 0.20-38 2018-11-04 [1] CRAN (R 3.6.1) #> lava 1.6.6 2019-08-01 [1] CRAN (R 3.6.0) #> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0) #> lhs 1.0.1 2019-02-03 [1] CRAN (R 3.6.0) #> lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.0) #> listenv 0.8.0 2019-12-05 [1] CRAN (R 3.6.1) #> lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) #> MASS 7.3-51.5 2019-12-20 [1] CRAN (R 3.6.0) #> Matrix 1.2-18 2019-11-27 [1] CRAN (R 3.6.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) #> ModelMetrics 1.2.2 2018-11-03 [1] CRAN (R 3.6.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0) #> nlme 3.1-140 2019-05-12 [1] CRAN (R 3.6.1) #> nnet 7.3-12 2016-02-02 [1] CRAN (R 3.6.0) #> parsnip 0.0.4.9000 2019-12-14 [1] local #> pillar 1.4.3 2019-12-20 [1] Github (r-lib/pillar@e2e7926) #> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) #> plyr 1.8.5 2019-12-10 [1] CRAN (R 3.6.0) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0) #> pROC 1.15.3 2019-07-21 [1] CRAN (R 3.6.0) #> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.0) #> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 3.6.0) #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) #> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) #> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0) #> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.0) #> recipes 0.1.8 2019-12-18 [1] local #> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0) #> reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.6.0) #> rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.0) #> rmarkdown 2.0 2019-12-12 [1] CRAN (R 3.6.0) #> rpart 4.1-15 2019-04-12 [1] CRAN (R 3.6.0) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) #> scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) #> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) #> survival 2.40-1 2016-10-30 [1] CRAN (R 3.6.1) #> testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.0) #> tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0) #> tidyr 1.0.0 2019-09-11 [1] CRAN (R 3.6.0) #> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0) #> timeDate 3043.102 2018-02-21 [1] CRAN (R 3.6.0) #> usethis 1.5.1.9000 2019-12-18 [1] Github (r-lib/usethis@b2e894e) #> vctrs 0.2.1 2019-12-17 [1] CRAN (R 3.6.1) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) #> workflows 0.1.0 2019-12-30 [1] CRAN (R 3.6.1) #> xfun 0.11 2019-11-12 [1] CRAN (R 3.6.0) #> xgboost 0.90.0.2 2019-08-01 [1] CRAN (R 3.6.0) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0) #> zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library ```
topepo commented 4 years ago

Here they are when run in the terminal instead of the RStudio IDE (where I see no difference)

> library(caret)
Loading required package: lattice
Loading required package: ggplot2
set.seed(1)

# create fake data
test_dat <- twoClassSim(20)

library("doFuture")
registerDoFuture()
plan(multiprocess, workers = parallel::detectCores() - 1)

system.time({
  set.seed(252)
  mod <- train(
    Class ~ .,
    data = test_dat,
    method = "xgbTree",
    tuneLength = 1,
    nthread = 1,
    trControl = trainControl(search = "random")
  )
})

train_control <- trainControl(search = 'random',
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

system.time({
  set.seed(252)
  mod <-
    train(
      Class ~ .,
      data = test_dat,
      method = "xgbTree",
      tuneLength = 10,
      nthreads = 1,
      trControl = train_control,
      metric = 'ROC'
    )
})
> set.seed(1)
> 
> # create fake data
> test_dat <- twoClassSim(20)
> 
> library("doFuture")
Loading required package: globals
Loading required package: future

Attaching package: ‘future’

The following object is masked from ‘package:caret’:

    cluster

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoFuture()
> plan(multiprocess, workers = parallel::detectCores() - 1)
> 
> system.time({
+   set.seed(252)
+   mod <- train(
+     Class ~ .,
+     data = test_dat,
+     method = "xgbTree",
+     tuneLength = 1,
+     nthread = 1,
+     trControl = trainControl(search = "random")
+   )
+ })
   user  system elapsed 
  3.595   1.820   1.202 
> 
> train_control <- trainControl(search = 'random',
+                               classProbs = TRUE,
+                               summaryFunction = twoClassSummary)
> 
> system.time({
+   set.seed(252)
+   mod <-
+     train(
+       Class ~ .,
+       data = test_dat,
+       method = "xgbTree",
+       tuneLength = 10,
+       nthreads = 1,
+       trControl = train_control,
+       metric = 'ROC'
+     )
+ })
   user  system elapsed 
 17.026   5.149   2.237 
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

and

> library(caret)
Loading required package: lattice
Loading required package: ggplot2
set.seed(1)

# create fake data
test_dat <- twoClassSim(20)

library("doFuture")
registerDoFuture()
plan(multiprocess, workers = parallel::detectCores() - 1)

# system.time({
#   set.seed(252)
#   mod <- train(
#     Class ~ .,
#     data = test_dat,
#     method = "xgbTree",
#     tuneLength = 1,
#     nthread = 1,
#     trControl = trainControl(search = "random")
#   )
# })

train_control <- trainControl(search = 'random',
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

system.time({
  set.seed(252)
  mod <-
    train(
      Class ~ .,
      data = test_dat,
      method = "xgbTree",
      tuneLength = 10,
      nthreads = 1,
      trControl = train_control,
      metric = 'ROC'
    )
})
> set.seed(1)
> 
> # create fake data
> test_dat <- twoClassSim(20)
> 
> library("doFuture")
Loading required package: globals
Loading required package: future

Attaching package: ‘future’

The following object is masked from ‘package:caret’:

    cluster

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoFuture()
> plan(multiprocess, workers = parallel::detectCores() - 1)
> 
> # system.time({
> #   set.seed(252)
> #   mod <- train(
> #     Class ~ .,
> #     data = test_dat,
> #     method = "xgbTree",
> #     tuneLength = 1,
> #     nthread = 1,
> #     trControl = trainControl(search = "random")
> #   )
> # })
> 
> train_control <- trainControl(search = 'random',
+                               classProbs = TRUE,
+                               summaryFunction = twoClassSummary)
> 
> system.time({
+   set.seed(252)
+   mod <-
+     train(
+       Class ~ .,
+       data = test_dat,
+       method = "xgbTree",
+       tuneLength = 10,
+       nthreads = 1,
+       trControl = train_control,
+       metric = 'ROC'
+     )
+ })
   user  system elapsed 
 17.392   5.128   2.261 
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
HenrikBengtsson commented 4 years ago

Hi, there could be multiple problems going on here. First, because 'xgboost' uses multi-threading internally, I suspect it is not safe to use forked parallel processing. Second, it looks like there are some non-exportable object/globals needed in the parallel processing that corrupts the R state.

For troubleshooting, it's useful to understand which foreach backends end up using forked processing (aka "multicore" processing). All of the following end up use the same forked processes framework of the 'parallel' package, which is also what parallel::mclapply() uses:

Comment: Note that plan(multiprocess) equals plan(multicore) on setups where multicore processing is supported (Linux and macOS but not MS Windows and not in the RStudio Console), otherwise it equals plan(multisession) where it uses PSOCK clusters. So, for troubleshooting it always better to use explicit plan(multicore) or plan(multisession).

For simplicity, let's use:

doFuture::registerDoFuture()
future::plan("multicore", workers = 2L)

Using this, I can reproduce the original problem - calling train() takes forever on R 3.6.1 on Linux (Ubuntu 18.04.3). It almost appears to stall completely, but using strace Rscript caret_issue_1106.R reveals that it is not completely stalled but instead it appears to stall for some time, then move very slightly forward, the stall again, and so on. I haven't waited for it to complete but, yes, I can imagine that will take hours and hours to finish.

So, forked processing and multi-threading should be avoided in R, a first attempt is to disable (OpenMPI) multi-threading used by xgboost. This can be done by setting:

OMP_NUM_THREADS=1

before R starts, e.g. by setting it in an .Renviron file. An alternative way to disable multi-threading is to call:

RhpcBLASctl::omp_set_num_threads(threads = 1L)

in R before calling train().

The above solves the problem; the script no longer stalls.

At this point, I thought the combination of forked processing and multi-threading was the sole reason. However, it looks like there's more to it because it also stalls when using PSOCK clusters;

doFuture::registerDoFuture()
future::plan("multisession", workers = 2L)

Since the workers in a PSOCK cluster, contrary to forked workers, are R processes independent of the main R session, I would not expect multi-threading to be an issue. In other words, there's something else going on too. BTW, note that all of the following creates PSOCK clusters:

What surprises me is that disabling multi-threading (as above) solves the problem.

Thus, there is something that requires xgboost to run in single-threaded mode when used in parallel code. This could be specific to caret or could apply for all parallel setups - I don't know.

Finally, by setting:

options(future.globals.onReference = "error")

before calling train() we tell the future framework to look for external references/pointers among the global objects exported to the parallel workers and produce an error if one is detected. Sure enough, doing so when using doFuture::doRegisterFuture() will report:

Error: Detected a non-exportable reference ('externalptr') in one of the globals (<unknown>) used in the future expression

which suggests that a non-exportable object is used by the workers and this could cause a race condition. The traceback for the above error is:

 1: source("caret_issue_1106.R")
 2: withVisible(eval(ei, envir))
 3: eval(ei, envir)
 4: eval(ei, envir)
 5: caret_issue_1106.R#81: system.time({
    mod <- train(Class ~ ., data = test_dat, method = "xgbTree", tuneLength = 10, nthreads = 1, 
 6: caret_issue_1106.R#82: train(Class ~ ., data = test_dat, method = "xgbTree", tuneLength = 10, nthreads = 1, trControl = train_control
 7: train.formula(Class ~ ., data = test_dat, method = "xgbTree", tuneLength = 10, nthreads = 1, trControl = train_control, metric = "ROC
 8: train(x, y, weights = w, ...)
 9: train.default(x, y, weights = w, ...)
10: nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, method = models, ppOpts = preProcess, ctrl = trControl, lev = cla
11: foreach(iter = seq(along = resampleIndex), .combine = "c", .verbose = FALSE, .export = export, .packages = "caret") %:% foreach(parm 
12: e$fun(obj, substitute(ex), parent.frame(), e$data)
13: getGlobalsAndPackages_doFuture(expr, envir = envir, export = obj$export, noexport = c(obj$noexport, argnames), packages = obj$package
14: getGlobalsAndPackages(expr, envir = globals_envir, globals = TRUE)
15: system.time({
    assert_no_references(globals, action = action)
}, gcFirst = FALSE)
16: assert_no_references(globals, action = action)

Details

If one run with options(error = utils::recover) and signals SIGINT (Ctrl-C) when train() stalls, we'll see that "multicore" seems to be stuck at:

11: foreach(iter = seq(along = resampleIndex), .combine = "c", .verbose = FALSE, .export = export, .packages = "car
12: e$fun(obj, substitute(ex), parent.frame(), e$data)
13: mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, mc.silent = silent, mc.cores = co
14: selectChildren(ac[!fin], -1)

and if we use a PSOCK cluster, it is stuck at:

11: foreach(iter = seq(along = resampleIndex), .combine = "c", .verbose = FALSE, .export = export, .packages = "car
12: e$fun(obj, substitute(ex), parent.frame(), e$data)
13: clusterApplyLB(cl, argsList, evalWrapper)
14: dynamicClusterApply(cl, fun, length(x), argfun)
15: recvOneResult(cl)
16: recvOneData(cl)
17: recvOneData.SOCKcluster(cl)
18: socketSelect(socklist)

Using strace show that both of these are stalled with kernel system call:

select(8, [6 7], [], NULL, {tv_sec=60, tv_usec=0}

I'm sure someone can dig into all the code and show there's some low-level multi-threaded race conditions going on here.

HenrikBengtsson commented 4 years ago

FWIW, it's been on my todo list for quite a while to automagically protect against using multi-threading and forked processing at the same time. A few days ago I decided to go ahead an implement this in the future framework. Most likely it'll automatically set RhpcBLASctl::omp_set_num_threads(1L) for "multicore" futures, cf. https://github.com/HenrikBengtsson/future/issues/355. But again, that would only solve half of the problem going on here.

topepo commented 4 years ago

Thanks @HenrikBengtsson ; this is extremely helpful and I would have never figured this out.

I suspect that this would also apply to ranger (as well as a bunch of others). I'm surprised that this has not come up previously.

HenrikBengtsson commented 4 years ago

You're welcome - I'm happy to help. Yeah, I think this happens for other packages too, but I don't know enough to tell exactly when multi-threading works and doesn't work with forked processing. It's on my way-too-long wishlist to read up on this and dive into the details - so this would be less of a "guessing game" (at least for me). I'm sure there are folks in the R community who have a better sense of what's going on and would be able to put together a list of do's and don't's.