Closed RaphaelS1 closed 3 years ago
Can you post a minimal working example? How do you run the experiments on the server? What future backend are you using? I can imagine that this is also caused because serialization of the R6 objects does not always work as expected / as it should when they are saved as RDS
I can imagine that this is also caused because serialization of the R6 objects does not always work as expected / as it should when they are saved as RDS
This is interesting because at one point we got this error Error in serialize(data, node$con) : error writing to connection
when using future.
Strategy was 'multiprocess', here's a screenshot of what it did to his computer....
Experiments were tried by running through RStudio and console, both had the same problems. Although again they switched randomly so at times console ran instantly and studio lagged.
The code we ran was very simple:
task = TaskSurv$new("untunedBenchmark", data, event = event, time = time)
task = po("encode", method = "treatment")$train(list(task))$output
learns = lrns(paste0("surv.", c("coxph", "cvglmnet", "randomForestSRC", "xgboost")))
rsmp = rsmp("cv", folds = cv_folds)
benchmark(benchmark_grid(task, learns, rsmp))
Can you try to disable the parallelization of the individual learners, e.g. nthreads
in xgboost?
Can you try to disable the parallelization of the individual learners, e.g. nthreads in xgboost?
These are disabled by default (nthreads = 1, for xgboost). Also the same crashing happened on Cox PH, which doesn't have any individual parallelization.
pls dont tell me that @jakob-r is correct and the serialization now also screws us here, when objects are serialized to be copied over to other processes
locally and on an external server; locally and with the exact same set-up the benchmark runs within 20mins.
what do you mean with "locally" here. on that local machine, was parallel / futures also switched on? and that did work? or was parallelization switched off?
Hi Bernd, on server I tried both with and without futures and both was terrible slow. Single thread on laptop was order of magnitudes faster and I cannot explain why...
pls dont tell me that @jakob-r is correct and the serialization now also screws us here, when objects are serialized to be copied over to other processes
I don't know anything about serialisation, but the error above : Error in serialize(data, node$con) : error writing to connection
, definitely indicates something going wrong in this process.
what do you mean with "locally" here. on that local machine, was parallel / futures also switched on? and that did work? or was parallelization switched off?
By locally I just mean on his personal computer as opposed to a remote server. Both had an identical R set-up (same R version, same package versions).
Error in serialize(data, node$con) : error writing to connection
, definitely indicates something going wrong in this process.
OTOH i am pretty sure i have seen this message also in contexts where R6 wasn't used (?)
By locally I just mean on his personal computer as opposed to a remote server. Both had an identical R set-up (same R version, same package versions)
sure, but did you use futures / multicore locally or not?
and have you tested futures on the server? without mlr3? does this really work without problems?
These are disabled by default (nthreads = 1, for xgboost). Also the same crashing happened on Cox PH, which doesn't have any individual parallelization.
Are you sure? Where is the current implementation? classif.xgboost
/regr.xgboost
/surv.xgboost
(in mlr3proba) don't have this set, so it defaults to OpenMP with the number of available cores.
Can you try to switch to the future plan "multisession" so that we can rule out nested threading/forking?
Also, this could be related to #482; do you have a fast file system locally and something comparably much slower remote (e.g., SSD vs HDD or SSD vs network storage?).
Are you sure? Where is the current implementation? classif.xgboost/regr.xgboost/surv.xgboost (in mlr3proba) don't have this set, so it defaults to OpenMP with the number of available cores.
You're right, I read the value off nrounds
by mistake, so yes it would have been running.
Can you try to switch to the future plan "multisession" so that we can rule out nested threading/forking?
@vollmersj when we next run on the server can we add nthreads = 1
to xgboost and set future to 'multisession' as suggested by Michel.
sure, but did you use futures / multicore locally or not? and have you tested futures on the server? without mlr3? does this really work without problems?
Due to time constraints I doubt we will be able to test futures separately on that server
Also, this could be related to #482; do you have a fast file system locally and something comparably much slower remote (e.g., SSD vs HDD or SSD vs network storage?).
This is interesting, but then would we not expect the difference to be constant? i.e. if we run the same experiment twice then even with this problem we'd still expect the same run-time (parallel or not); whereas we were finding situations where it lagged seemingly at random
Due to time constraints I doubt we will be able to test futures separately on that server
huh? isn't that just running 3 lines of code?
huh? isn't that just running 3 lines of code?
Maybe I'm misunderstanding, how would you test future without mlr3?
Sorry I see, literally just run any code with futures enabled. Sure we could do this. So @vollmersj if possible just run their example code
fA <- future( lm(y ~ x, weights = w) )
fB <- future( lm(y ~ x - 1, weights = w) )
fC <- future({
w <- 1 + abs(x)
lm(y ~ x, weights = w)
})
fitA <- value(fA)
fitB <- value(fB)
fitC <- value(fC)
print(fitA)
print(fitB)
print(fitC)
and see if that crashes the system...
{future} is on the market for a long time and well tested. In the background it only uses the common R parallelization backends and does not do fancy stuff.
I'd also bet on the xgboost openMP parallelization interfering. Test that first and then we can narrow it down further. If that holds true, we should disable it by default in the xgboost learner.
{future} is on the market for a long time and well tested. In the background it only uses the common R parallelization backends and does not do fancy stuff. I'd also bet on the xgboost openMP parallelization interfering. Test that first and then we can narrow it down further. If that holds true, we should disable it by default in the xgboost learner.
This all makes sense, will let you know once this is tested. But to confirm, could xgboost use of openMP, cause other learners to crash if future is also being used?
If two parallel backends want to use all cores of the machine at the same time without accounting for each other - the ship might sink.
Hard to say what happens in practice or who takes precedence. I've never tried it and I'm not keen on doing so :)
Having openMP enabled by default is dangerous because users might not be aware and instead start other parallelization backends. This is also why it was disabled in learners in mlr2 already.
One could now argue that users should know if they use parallelization - but again, if everyone (devs) would turn it off by default, then we could avoid such clashes. If one does it different - issues will occur.
One could now argue that users should know if they use parallelization - but again, if everyone (devs) would turn it off by default, then we could avoid such clashes.
I don't necessarily know that we should expect people to know this. I didn't know it because it never occurred to me that any package would turn it on by default, I personally find parallelisation quite a "personal" setting in that my computer is not powerful and the last thing I want is someone to start using up my cores without me knowing about it !
Anyway I'll let you know how the tests go (tomorrow hopefully).
One could now argue that users should know if they use parallelization
This is not my view but how some people look at it. There are more R packages out there that do this - now you know and it will always be in the back of your mind 😛
I personally find parallelisation quite a "personal" setting in that my computer is not powerful and the last thing I want is someone to start using up my cores without me knowing about it
This is also the dogma of the {future} package :)
I've now had the chance to use a remote server again and found that my benchmark experiments were crashing in a same manner to the example at the top. For example the experiment runs very quickly then hits a particular model and never progresses (I waited 48 hours). Whilst I have not been able to test this for all models, I found that one of these crashes from a learner in randomForestSRC
which by default has openMP enabled and can only be turned off by setting the global option
. I have not yet been able to confirm if this is the case for all my crashed models but I suspect that @pat-s originally assumption about this was correct. Whilst we can turn-off parallelisation for 'nice' learners that give us a choice, I don't know if it is 'fair' to set global options for learners?
sry for not reading the whole thread, but openMP doesn't work with (unix-)fork()
-based parallelization (like parallel::mclapply
, and whatever future backend is based on that), see here and my message after that. Either don't use fork or disable parallelization (with env vars, parameter settings, or options or whatever is necessary). I do think that mlr3 should initialize objects in a way that makes them single-threaded by default.
Thanks @mb706 ! This confirms what Patrick was saying and this issue further supports your argument to initialize objects with a single-threaded default
I'd argue better safe than sorry here. We should turn off openMP if we can, including a message to the user. Otherwise such issues cost hours. Re-enabling is quick.
@RaphaelS1 If you have time, feel free to make the required changes in the randomForestSRC learner.
Okay will do, for some, like the randomForestSRC, it took me a while to find how they actually use openMP and where the user can set this, so I do agree a clear and informative message is required
Yes, unfortunately everyone does it differently - nothing new for us. This is why mlr was born in the first place, wasn't it? 😄
Thanks!
Can you try again with leanify and new internal data structure of RR/BMR?
Closing this as the error is confirmed due to parallelisation problems and not external servers, therefore referring to #546
Sorry for this very vague bug report but if anyone could help at all that would be great. With Sebastian (@vollmersj) we have been running some benchmarks locally and on an external server; locally and with the exact same set-up the benchmark runs within 20mins. On the server it takes 90mins just to run a single tuning configuration. Using futures on the server caused many R sessions to run in background and all to crash. We've also noticed that depending which models are included in the benchmark, printing of the time-stamped updates are suppressed for some reason and then just all appear in one go at the very end. This does not appear to be caused by a single model or measure, and swapping them around does not change anything.
I appreciate there may not be a lot you can do with this report but I've attached screenshots of the benchmark output, and the session info, so if anyone has seen this problem before or has any suggestions please let me know!
Thanks