mlr-org / mlr3

mlr3: Machine Learning in R - next generation
https://mlr3.mlr-org.com
GNU Lesser General Public License v3.0
943 stars 85 forks source link

Benchmark function crashing on external servers #501

Closed RaphaelS1 closed 3 years ago

RaphaelS1 commented 4 years ago

Sorry for this very vague bug report but if anyone could help at all that would be great. With Sebastian (@vollmersj) we have been running some benchmarks locally and on an external server; locally and with the exact same set-up the benchmark runs within 20mins. On the server it takes 90mins just to run a single tuning configuration. Using futures on the server caused many R sessions to run in background and all to crash. We've also noticed that depending which models are included in the benchmark, printing of the time-stamped updates are suppressed for some reason and then just all appear in one go at the very end. This does not appear to be caused by a single model or measure, and swapping them around does not change anything.

I appreciate there may not be a lot you can do with this report but I've attached screenshots of the benchmark output, and the session info, so if anyone has seen this problem before or has any suggestions please let me know!

Thanks

INFO  [06:28:23.958] 6 configurations evaluated 
INFO  [06:28:26.671] Evaluating 1 configurations 
INFO  [06:28:26.674]      alpha    gamma       eta subsample max_depth nrounds 
INFO  [06:28:26.674]  0.4444444 7.777778 0.2666667       0.3        10     133 
INFO  [06:28:28.679] Benchmark with 3 resampling iterations 
INFO  [06:28:28.681] Applying learner 'surv.xgboost' on task 'tunedBenchmark' (iter 1/3) 
INFO  [06:59:40.155] Applying learner 'surv.xgboost' on task 'tunedBenchmark' (iter 2/3) 
INFO  [07:35:04.909] Applying learner 'surv.xgboost' on task 'tunedBenchmark' (iter 3/3) 
INFO  [08:03:06.835] Finished benchmark 
INFO  [08:03:08.555] Result of batch 7: 
INFO  [08:03:08.559]      alpha    gamma       eta subsample max_depth nrounds surv.harrellC 
INFO  [08:03:08.559]  0.4444444 7.777778 0.2666667       0.3        10     133      0.612152 
INFO  [08:03:08.561] 7 configurations evaluated 
INFO  [08:03:11.257] Evaluating 1 configurations 
INFO  [08:03:11.260]      alpha    gamma       eta subsample max_depth nrounds 
INFO  [08:03:11.260]  0.5555556 1.111111 0.2666667 0.9222222         9     133 
INFO  [08:03:13.117] Benchmark with 3 resampling iterations 
INFO  [08:03:13.118] Applying learner 'surv.xgboost' on task 'tunedBenchmark' (iter 1/3) 
INFO  [08:29:35.588] Applying learner 'surv.xgboost' on task 'tunedBenchmark' (iter 2/3) 
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] SHAPforxgboost_0.0.4 paradox_0.2.0        mlr3pipelines_0.1.3  forcats_0.5.0       
 [5] stringr_1.4.0        dplyr_0.8.5          purrr_0.3.4          readr_1.3.1         
 [9] tidyr_1.1.0          tibble_3.0.1         ggplot2_3.3.0        tidyverse_1.3.0     
[13] mlr3learners_0.2.0   mlr3tuning_0.1.2     mlr3proba_0.1.5      mlr3_0.2.0          
loaded via a namespace (and not attached):
 [1] nlme_3.1-143          fs_1.4.1              lubridate_1.7.8      
 [4] RColorBrewer_1.1-2    httr_1.4.1            tools_3.6.2          
 [7] backports_1.1.7       R6_2.4.1              DBI_1.1.0            
[10] colorspace_1.4-1      withr_2.2.0           mlr3misc_0.2.0       
[13] tidyselect_1.1.0      curl_4.3              compiler_3.6.2       
[16] glmnet_4.0            cli_2.0.2             rvest_0.3.5          
[19] lgr_0.3.4             xml2_1.3.2            scales_1.1.1         
[22] checkmate_2.0.0       digest_0.6.25         foreign_0.8-72       
[25] rio_0.5.16            set6_0.1.4            pkgconfig_2.0.3      
[28] dbplyr_1.4.3          rlang_0.4.6           readxl_1.3.1         
[31] rstudioapi_0.11       BBmisc_1.11           shape_1.4.4          
[34] farver_2.0.3          generics_0.0.2        jsonlite_1.6.1       
[37] zip_2.0.4             car_3.0-7             magrittr_1.5         
[40] Matrix_1.2-18         Rcpp_1.0.4.6          munsell_0.5.0        
[43] fansi_0.4.1           abind_1.4-5           lifecycle_0.2.0      
[46] stringi_1.4.6         carData_3.0-3         MASS_7.3-51.4        
[49] grid_3.6.2            listenv_0.8.0         parallel_3.6.2       
[52] crayon_1.3.4          lattice_0.20-38       haven_2.2.0          
[55] splines_3.6.2         hms_0.5.3             pillar_1.4.4         
[58] ggpubr_0.3.0          uuid_0.1-4            ggsignif_0.6.0       
[61] xgboost_1.0.0.2       future.apply_1.5.0    codetools_0.2-16     
[64] reprex_0.3.0          glue_1.4.1            data.table_1.12.8    
[67] modelr_0.1.8          foreach_1.5.0         vctrs_0.3.0          
[70] tweenr_1.0.1          distr6_1.3.7          cellranger_1.1.0     
[73] gtable_0.3.0          polyclip_1.10-0       future_1.17.0        
[76] assertthat_0.2.1      ggforce_0.3.1         openxlsx_4.1.5       
[79] broom_0.5.6           pracma_2.2.9          rstatix_0.5.0        
[82] survival_3.1-8        randomForestSRC_2.9.3 iterators_1.0.12     
[85] globals_0.12.5        ellipsis_0.3.0        R62S3_1.4.1
jakob-r commented 4 years ago

Can you post a minimal working example? How do you run the experiments on the server? What future backend are you using? I can imagine that this is also caused because serialization of the R6 objects does not always work as expected / as it should when they are saved as RDS

RaphaelS1 commented 4 years ago

I can imagine that this is also caused because serialization of the R6 objects does not always work as expected / as it should when they are saved as RDS

This is interesting because at one point we got this error Error in serialize(data, node$con) : error writing to connection when using future.

Strategy was 'multiprocess', here's a screenshot of what it did to his computer....

image

Experiments were tried by running through RStudio and console, both had the same problems. Although again they switched randomly so at times console ran instantly and studio lagged.

The code we ran was very simple:

  task = TaskSurv$new("untunedBenchmark", data, event = event, time = time)
  task = po("encode", method = "treatment")$train(list(task))$output
  learns = lrns(paste0("surv.", c("coxph", "cvglmnet", "randomForestSRC", "xgboost")))
  rsmp = rsmp("cv", folds = cv_folds)
  benchmark(benchmark_grid(task, learns, rsmp))
mllg commented 4 years ago

Can you try to disable the parallelization of the individual learners, e.g. nthreads in xgboost?

RaphaelS1 commented 4 years ago

Can you try to disable the parallelization of the individual learners, e.g. nthreads in xgboost?

These are disabled by default (nthreads = 1, for xgboost). Also the same crashing happened on Cox PH, which doesn't have any individual parallelization.

berndbischl commented 4 years ago

pls dont tell me that @jakob-r is correct and the serialization now also screws us here, when objects are serialized to be copied over to other processes

berndbischl commented 4 years ago

locally and on an external server; locally and with the exact same set-up the benchmark runs within 20mins.

what do you mean with "locally" here. on that local machine, was parallel / futures also switched on? and that did work? or was parallelization switched off?

vollmersj commented 4 years ago

Hi Bernd, on server I tried both with and without futures and both was terrible slow. Single thread on laptop was order of magnitudes faster and I cannot explain why...

RaphaelS1 commented 4 years ago

pls dont tell me that @jakob-r is correct and the serialization now also screws us here, when objects are serialized to be copied over to other processes

I don't know anything about serialisation, but the error above : Error in serialize(data, node$con) : error writing to connection, definitely indicates something going wrong in this process.

what do you mean with "locally" here. on that local machine, was parallel / futures also switched on? and that did work? or was parallelization switched off?

By locally I just mean on his personal computer as opposed to a remote server. Both had an identical R set-up (same R version, same package versions).

berndbischl commented 4 years ago

Error in serialize(data, node$con) : error writing to connection, definitely indicates something going wrong in this process.

OTOH i am pretty sure i have seen this message also in contexts where R6 wasn't used (?)

By locally I just mean on his personal computer as opposed to a remote server. Both had an identical R set-up (same R version, same package versions)

sure, but did you use futures / multicore locally or not?

and have you tested futures on the server? without mlr3? does this really work without problems?

mllg commented 4 years ago

These are disabled by default (nthreads = 1, for xgboost). Also the same crashing happened on Cox PH, which doesn't have any individual parallelization.

Are you sure? Where is the current implementation? classif.xgboost/regr.xgboost/surv.xgboost (in mlr3proba) don't have this set, so it defaults to OpenMP with the number of available cores.

mllg commented 4 years ago

Can you try to switch to the future plan "multisession" so that we can rule out nested threading/forking?

Also, this could be related to #482; do you have a fast file system locally and something comparably much slower remote (e.g., SSD vs HDD or SSD vs network storage?).

RaphaelS1 commented 4 years ago

Are you sure? Where is the current implementation? classif.xgboost/regr.xgboost/surv.xgboost (in mlr3proba) don't have this set, so it defaults to OpenMP with the number of available cores.

You're right, I read the value off nrounds by mistake, so yes it would have been running.

Can you try to switch to the future plan "multisession" so that we can rule out nested threading/forking?

@vollmersj when we next run on the server can we add nthreads = 1 to xgboost and set future to 'multisession' as suggested by Michel.

sure, but did you use futures / multicore locally or not? and have you tested futures on the server? without mlr3? does this really work without problems?

Due to time constraints I doubt we will be able to test futures separately on that server

Also, this could be related to #482; do you have a fast file system locally and something comparably much slower remote (e.g., SSD vs HDD or SSD vs network storage?).

This is interesting, but then would we not expect the difference to be constant? i.e. if we run the same experiment twice then even with this problem we'd still expect the same run-time (parallel or not); whereas we were finding situations where it lagged seemingly at random

berndbischl commented 4 years ago

Due to time constraints I doubt we will be able to test futures separately on that server

huh? isn't that just running 3 lines of code?

RaphaelS1 commented 4 years ago

huh? isn't that just running 3 lines of code?

Maybe I'm misunderstanding, how would you test future without mlr3?

RaphaelS1 commented 4 years ago

Sorry I see, literally just run any code with futures enabled. Sure we could do this. So @vollmersj if possible just run their example code

fA <- future( lm(y ~ x, weights = w) )
fB <- future( lm(y ~ x - 1, weights = w) )
fC <- future({
  w <- 1 + abs(x)
  lm(y ~ x, weights = w)
})
fitA <- value(fA)
fitB <- value(fB)
fitC <- value(fC)
print(fitA)
print(fitB)
print(fitC)

and see if that crashes the system...

pat-s commented 4 years ago

{future} is on the market for a long time and well tested. In the background it only uses the common R parallelization backends and does not do fancy stuff.

I'd also bet on the xgboost openMP parallelization interfering. Test that first and then we can narrow it down further. If that holds true, we should disable it by default in the xgboost learner.

RaphaelS1 commented 4 years ago

{future} is on the market for a long time and well tested. In the background it only uses the common R parallelization backends and does not do fancy stuff. I'd also bet on the xgboost openMP parallelization interfering. Test that first and then we can narrow it down further. If that holds true, we should disable it by default in the xgboost learner.

This all makes sense, will let you know once this is tested. But to confirm, could xgboost use of openMP, cause other learners to crash if future is also being used?

pat-s commented 4 years ago

If two parallel backends want to use all cores of the machine at the same time without accounting for each other - the ship might sink.

Hard to say what happens in practice or who takes precedence. I've never tried it and I'm not keen on doing so :)

Having openMP enabled by default is dangerous because users might not be aware and instead start other parallelization backends. This is also why it was disabled in learners in mlr2 already.

One could now argue that users should know if they use parallelization - but again, if everyone (devs) would turn it off by default, then we could avoid such clashes. If one does it different - issues will occur.

RaphaelS1 commented 4 years ago

One could now argue that users should know if they use parallelization - but again, if everyone (devs) would turn it off by default, then we could avoid such clashes.

I don't necessarily know that we should expect people to know this. I didn't know it because it never occurred to me that any package would turn it on by default, I personally find parallelisation quite a "personal" setting in that my computer is not powerful and the last thing I want is someone to start using up my cores without me knowing about it !

Anyway I'll let you know how the tests go (tomorrow hopefully).

pat-s commented 4 years ago

One could now argue that users should know if they use parallelization

This is not my view but how some people look at it. There are more R packages out there that do this - now you know and it will always be in the back of your mind 😛

I personally find parallelisation quite a "personal" setting in that my computer is not powerful and the last thing I want is someone to start using up my cores without me knowing about it

This is also the dogma of the {future} package :)

RaphaelS1 commented 4 years ago

I've now had the chance to use a remote server again and found that my benchmark experiments were crashing in a same manner to the example at the top. For example the experiment runs very quickly then hits a particular model and never progresses (I waited 48 hours). Whilst I have not been able to test this for all models, I found that one of these crashes from a learner in randomForestSRC which by default has openMP enabled and can only be turned off by setting the global option. I have not yet been able to confirm if this is the case for all my crashed models but I suspect that @pat-s originally assumption about this was correct. Whilst we can turn-off parallelisation for 'nice' learners that give us a choice, I don't know if it is 'fair' to set global options for learners?

mb706 commented 4 years ago

sry for not reading the whole thread, but openMP doesn't work with (unix-)fork()-based parallelization (like parallel::mclapply, and whatever future backend is based on that), see here and my message after that. Either don't use fork or disable parallelization (with env vars, parameter settings, or options or whatever is necessary). I do think that mlr3 should initialize objects in a way that makes them single-threaded by default.

RaphaelS1 commented 4 years ago

Thanks @mb706 ! This confirms what Patrick was saying and this issue further supports your argument to initialize objects with a single-threaded default

pat-s commented 4 years ago

I'd argue better safe than sorry here. We should turn off openMP if we can, including a message to the user. Otherwise such issues cost hours. Re-enabling is quick.

@RaphaelS1 If you have time, feel free to make the required changes in the randomForestSRC learner.

RaphaelS1 commented 4 years ago

Okay will do, for some, like the randomForestSRC, it took me a while to find how they actually use openMP and where the user can set this, so I do agree a clear and informative message is required

pat-s commented 4 years ago

Yes, unfortunately everyone does it differently - nothing new for us. This is why mlr was born in the first place, wasn't it? 😄

Thanks!

mllg commented 4 years ago

Can you try again with leanify and new internal data structure of RR/BMR?

RaphaelS1 commented 3 years ago

Closing this as the error is confirmed due to parallelisation problems and not external servers, therefore referring to #546