mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 405 forks source link

Java based learners fail with parallelMap multicore #1898

Closed mb706 closed 4 years ago

mb706 commented 7 years ago

This is because fork(), which multicore is ultimately based on, and the java VM don't play along well if java is started before the forking happens. Loading java based packages, e.g. "RWeka", seems to start the java VM, so if the package gets loaded outside of the parallelMap call. it fails.

> library("mlr")
Loading required package: ParamHelpers
> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> resample("classif.IBk", pid.task, cv5)  # loads RWeka, then calls parallelMap
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2:
# hang

If, on the other hand, the fork is before loading the java vm, it works fine:

> library("mlr")
Loading required package: ParamHelpers
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> parallelMap(function(x) resample("classif.IBk", pid.task, cv5), 1:2, simplify=FALSE)
Mapping in parallel: mode = multicore; cpus = 2; elements = 2.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 1: mmce.test.mean=0.275
[Resample] cross-validation iter 2: mmce.test.mean=0.331
[Resample] cross-validation iter 2: mmce.test.mean=0.273
# ...
# no hang

I therefore suggest to have a configureMlr option to defer loading of packages until a learner's train or predict function gets called. The user would still need to be careful not to load "RWeka" when he wants to use multicore, but this at least would give him the option. When a learner gets constructed, instead of loading a learner's package, mlr should simply check whether the requested package exists.

mb706 commented 7 years ago

A current workaround is to load a learner from a savefile. E.g. if a learner is loaded from the .RData file at start, resampling with multicore works.

> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> library("mlr")
Loading required package: ParamHelpers
> lrn = makeLearner("classif.IBk")
> resample(lrn, pid.task, cv5)
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2: ^C^C^C^C^C
> q("yes")
$ R
R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
[...]
> library("mlr")
Loading required package: ParamHelpers
> library("parallelMap")
> parallelStartMulticore(2)
Starting parallelization in mode=multicore with cpus=2.
> resample(lrn, pid.task, cv5)
Mapping in parallel: mode = multicore; cpus = 2; elements = 5.
[Resample] cross-validation iter 1: [Resample] cross-validation iter 2: mmce.test.mean=0.266
mmce.test.mean=0.318
# no hang
berndbischl commented 7 years ago

1) We did have that issue before. But not with the insights you presented here. It is also more a parallelmap issue right?

2) so the problem is that we load RWeka on the master, on learner construction, that is what makes the bug appear?

mb706 commented 7 years ago
  1. AFAICS parallelMap can not do much about it; when using "multicore" and the JVM is already loaded, java can not be used (link). It also appears impossible to load a new JVM or unload the old one (link).
  2. Basically loading anything that uses java in the main process, be it RWeka, extraTrees, or rJava itself, will make it impossible to run a java based learner parallelized with parallelMap + multicore afterwards. The best we can do is not load rjava on purpose on the main process. If the user loaded it before for some other reason there is nothing I can see we could do, except maybe check for this stuff in the trainLearner function to prevent hanging.
Masutani commented 6 years ago

I ran my own rJava based custom learner. It works find single thread, however with parallelStartSocket() I got some time out of session like this :


Exporting objects to slaves for mode socket: .mlr.slave.options Mapping in parallel: mode = socket; cpus = 20; elements = 1. Error in stopWithJobErrorMessages(inds, vcapply(result.list[inds], as.character)) : Errors occurred in 1 slave jobs, displaying at most 10 of them:

00001: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.IllegalStateException: This trial session has expired. Each trial session is limited to 120 minutes.

Is this caused by same restriction on mcapply (parallelMap) compatibility with JVM as you stated here ?

mb706 commented 6 years ago

parallelStartSocket is not based on and should not call mclapply, so I am pretty sure it is not because of this issue.

(Note that parallelMap in "socket" mode behaves slightly different from "multicore" mode in that the worker jobs are executed in a (kind of) vanilla environment with sockets; you might have to call parallelExport and parallelLibrary with "socket" when you wouldn't need to with "multicore".)

Masutani commented 6 years ago

Hi I confirmed the time out is caused by something different with this issue though the single thread didn't take such duration. However parallelStartSocket is good alternative for parallelStartMulticore. What is a drawback of Socket compared to Multicore ? Only overhead , and necessity of export libraries ?

mb706 commented 6 years ago

Multicore uses the operating system's fork() to create child processes that have copy-on-write access to the parent process's memory. If you're working with a big dataset this means you can potentially have many processes operating on this data while only using up memory for the dataset once. (I think sometimes R's garbage collection messes this up and more memory gets used than needed, but usually it works). When you're using sockets, every individual worker process needs to separately load the data, so you have the overhead of (1) serialising the data from the main process and sending it to the worker processes and (2) keeping the data in memory for each process separately.

(I don't know parallelStartSocket that well however, so don't take my word for it.)

Masutani commented 6 years ago

Thanks for such general question. I understood parallelStartSocket has significant overhead compared to parallelStartMultiCore. In my case, 40 core CPU cannot be available without multi thread/process, and MultiCore option cannot be used for my Java based code (because of the original issue in this thread). Socket solution seems to be alternative in case such incompatibility / scalability problem and only option for Windows. By the way I hope multi-level parallel (ex. Benchmark * Resample) will be supported.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.