mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

Mutlicore does not work properly (Warning in selectChildren) #242

Open gsteinbu opened 4 years ago

gsteinbu commented 4 years ago

I am having some trouble with the Multicore cluster for parallel computation. It seems that once there are more jobs than CPU's submitJobs() keeps running but does not submit more jobs (i.e. no CPU usage). Here a mini example.

library("batchtools") # Version 0.9.11
tmp = makeRegistry(file.dir = NA, make.default = FALSE)
tmp$cluster.functions = makeClusterFunctionsMulticore(1)
piApprox = function(n) {
  nums = matrix(runif(2 * n), ncol = 2)
  d = sqrt(nums[, 1]^2 + nums[, 2]^2)
  4 * mean(d <= 1)
}
batchMap(fun = piApprox, n = rep(1e5, 2), reg = tmp)
# submitJobs(ids = 1, reg = tmp) # Works
submitJobs(ids = findJobs(reg = tmp), reg = tmp) # Gets stuck

Once I stop the execution, there is multiple warnings like the ones described in #221 (see below). However, in my case jobs will not be completed which makes it (at least for me) quite annoying.

Submitting 2 jobs in 2 chunks using cluster functions 'Multicore' ...

Warnmeldungen:
1: In selectChildren(jobs, timeout) :
  cannot wait for child 22536 as it does not exist
2: In selectChildren(jobs, timeout) :
  cannot wait for child 22536 as it does not exist

I already found a fix, but I am not sure if it is a good one. I changed line 63 of 'clusterFunctionsMulticore.R' as follows.

self$jobs = self$jobs[count <= 1L] # old
self$jobs = self$jobs[count < 1L] # new

With this change it seems to work again. I got the impression that in the original version it does not reduce the self$jobs table because the value for count seems to be originally 0 and then count + 1 is also <= 1. Thus, it gets stuck in the repeat loop. However, with a quick search I didn't find anything in your commits that explains this (although it worked back in the day for me.). Here is some additional info on my setup.

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libopenblasp-r0.3.6.so
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] batchtools_0.9.11 data.table_1.12.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        prettyunits_1.0.2 withr_2.1.2       assertthat_0.2.1  digest_0.6.18     crayon_1.3.4     
 [7] rappdirs_0.3.1    R6_2.4.0          backports_1.1.4   magrittr_1.5      rlang_0.3.4       progress_1.2.2   
[13] stringi_1.4.3     fs_1.3.1          rstudioapi_0.10   brew_1.0-6        checkmate_1.9.1   tools_3.6.1      
[19] hms_0.4.2         parallel_3.6.1    compiler_3.6.1    pkgconfig_2.0.2   base64url_1.4    
HenrikBengtsson commented 4 years ago

FWIW, that warning was introduced in R 3.5.0 as they added an internal sanity run-time check. That revealed some problem, which were fixed in R 3.5.1 patched. In other words, there was a bug in R (>= 3.5.0 & < 3.5.2) that produced this warning and all we could do was to suppress it. See https://github.com/HenrikBengtsson/future/issues/218 for the whole story.

Now, that does not explain why it appears for you in R 3.6.1, but if you can reproduce it with plain 'parallel' code, e.g. some mc*() functions, then you're onto something.

gsteinbu commented 4 years ago

Thanks for the input @HenrikBengtsson! Up front: I think the warning is ok here. I.e., I think parallel works as intended but just different to earlier version of R.

Based on it I checked with an old version of R (3.4.4) on another system. I think the issue is that in the newer R version the second call to mccollect() behaves different. In the old version the second is not just 'NULL' but still a named list with a NULL item:

> mccollect = function(jobs, timeout = 0) {
+   parallel::mccollect(jobs, wait = FALSE, timeout = timeout)
+ }
> 
> job <- parallel::mcparallel(1)
> Sys.sleep(0.1)
> mccollect(job$pid)
$`2428`
[1] 1

> mccollect(job$pid)
$`2428`
NULL

> mccollect(job$pid)
NULL

Here the same with the new R version:

> mccollect = function(jobs, timeout = 0) {
+   parallel::mccollect(jobs, wait = FALSE, timeout = timeout)
+ }
> 
> job <- parallel::mcparallel(1)
> Sys.sleep(0.1)
> mccollect(job$pid)
$`87015`
[1] 1

> mccollect(job$pid)
NULL
Warning message:
In selectChildren(jobs, timeout) :
  cannot wait for child 87015 as it does not exist
> mccollect(job$pid)
NULL
Warning message:
In selectChildren(jobs, timeout) :
  cannot wait for child 87015 as it does not exist

It produces the warnings (which are justified here I think?) but also the second call is just NULL. Thus, the self$jobs tables count is never increased to 2 (due to the if (is.null(res)) break) and the outer repeat loop (in line 40) is stuck.

It also turns out my quick fix above does not work (it already seemed like a weird fix to me). I think the second call to mccollect() is somehow necessary. However, I don't know enough about the whole process to know why (or if that is actually the issue).

gsteinbu commented 4 years ago

I tried a few more things. However, I can't get it working properly. With my initial approach to fix it, memory usage keeps increasing and processes do not seem to be handled correct anymore (although I am not sure why). In all cases (with my fix or not) the batchtools tests pass though. My guess is that the issue arose with a fix they did in R 3.6.0: to prevent zombie processes they added an rmChild(x) in mccollect() (see Commit in r-source). This makes the second call to mccollect() unnecessary, and results in the NULL result in the second call to it already (as described in my previous comment). However, I am unsure if the possible zombie process is the only reason for the second call to mccollect() in batchtools. Since my initial fix does not really work, I guess there are other reasons as well or the way collecting from jobs is done in batchtools depends on this in other ways. I now switch to another dirty fix that seems to do better. I don't like this fix much either though. Basically at the start of 'clusterFunctionsMulticore.R' there is a version of the mccollect() function that is used for compatibility with R versions prior to 3.3.2. I modified the code such that this mccollect() function is used for R versions greater or equal to 3.6.0 as well. Following the corresponding change in line 1 of 'clusterFunctionsMulticore.R'.

if ((getRversion() < "3.3.2" | getRversion() >= "3.6.0") && .Platform$OS.type != "windows") {
# OLD: if (getRversion() < "3.3.2" && .Platform$OS.type != "windows") {

I did not test this fix at length yet, so it might not really work well. There is also a lot of the warnings mention earlier from selectChildren() since this function is used in the version of parallel on the system. However, at least it seems that experiments are computed again.