Question about parallelization and CPU usage

bblodfon commented 2 years ago

Hi,

I have tested this benchmark script on 2 servers, one with 32 CPUs and one with 256 CPUs. I never get all CPUs utilized - i.e. I only get around 10 out of 32 and <100 respectively at 100%. I thought that all CPUs would be used in such a multisession configuration? Kinda same expectation I had with nested-CV using this script, but again less CPUs were fully utilized during execution. Any thoughts why this is happening, i.e. is it expected/normal behaviour?

be-marc commented 2 years ago

glmnet_lrn  = lrn('surv.glmnet', id = 'CoxLasso', standardize = FALSE, lambda = 0.01, alpha = 1)
xgboost_lrn = lrn('surv.xgboost', id = 'XGBoost Survival Learner')
rpart_lrn   = lrn('surv.rpart', id = 'Survival Tree')
ranger_lrn  = lrn('surv.ranger', id = 'Survival Forest', verbose = FALSE)

These learners will probably fit a model very fast i.e. a single CPU will be for a very short time at 100%. Try to set a higher number for nrounds of surv.xgboost or num.trees of surv.ranger and watch your CPUs again.

bblodfon commented 2 years ago

Thanks, I will try that! I really thought that the implicit parallelization of e.g. surv.ranger would interfere with future's parallelization, as the documentation says.

Also the batch_size in tuning seems to really affect total CPU utilization - I think its importance maybe not be stressed enough in the parallelization part of the mlr3book!

mllg commented 2 years ago

We have implicit parallelization turned off in ranger to avoid any interferences. You can enable it with set_threads() (but I would not recommend this for all parallelization backends).

I'll include a warning in the book about the batch size.

mllg commented 2 years ago

https://github.com/mlr-org/mlr3book/commit/3173ed79201a3470c5329e12e522649d972fbb58

bblodfon commented 2 years ago

@mllg I also found that while doing nested-CV with a learner that doesn't support implicit parallelization (or it is turned off by default like in ranger) and it takes at least some seconds to run a simple iteration/train on a particular dataset, setting up a combination of future::plan(list("sequential", "multisession")) and increasing batch_size for the inside tuner was (across my benchmarks) much faster and showed better CPU utilization (more cores were used) compared to the same number of batch_size and a future::plan(list("multisession", "sequential")). I don't know if it that's how it is supposed to work in general, but oh well you always have to test and see :)

I am now trying to figure out if there will be some benefit running nested-CV with some future plan + implicit parallelization enabled (to some extend) compared to just doing everything sequential and setting a large number of threads for the learner e.g. ranger.

The total number of CPUs available is a major factor and I think some generic rules of thumb would be a great addition to the documentation as well, e.g. I have 32 CPUs total I can use and want to do nested-CV, set future::plan(list("multisession", "sequential")) with #out_folds = 5 and batch_size=5 in the tuner to utilize 25 CPUs (so using close to everything but not all).

bblodfon commented 2 years ago

You included it already, that's okay :)

mlr-org / mlr3

Question about parallelization and CPU usage #828