mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

multicore collects children inappropriately #221

Open mtmorgan opened 5 years ago

mtmorgan commented 5 years ago

This script

library(batchtools)
res <- NULL
registry <- makeRegistry(tempfile())
registry$cluster.functions <- makeClusterFunctionsMulticore(2); gc()
ids = batchMap(identity, 1:2, more.args = list(), reg = registry); gc()
ids$chunk = chunk(ids$job.id, 2); gc()
submitJobs(ids = ids, reg = registry); gc()
waitForJobs(ids = ids, reg = registry); gc()
res <- reduceResultsList(ids = ids, reg = registry); gc()
clearRegistry(reg=registry); gc()

generates warnings like

> waitForJobs(ids = ids, reg = registry); gc()

[1] TRUE
Warning messages:
1: In selectChildren(jobs, timeout) :
  cannot wait for child 59480 as it does not exist
2: In selectChildren(jobs, timeout) :
  cannot wait for child 59481 as it does not exist
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  655286 35.0    1399907 74.8         NA  1399907 74.8
Vcells 1349214 10.3    8388608 64.0      32768  3190967 24.4
mllg commented 5 years ago

I cannot reproduce this on Arch Linux with R version 3.5.2. Can you provide a sessionInfo()?

Also, I'm not sure how to solve this. I see no way to collect the results from the forked processes expcept calling mccollect(). I could suppress the warning, but this is more like a workaround that a solution.

mtmorgan commented 5 years ago
> sessionInfo()
R Under development (unstable) (2019-02-08 r76071)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Users/ma38727/bin/R-devel/lib/libRblas.dylib
LAPACK: /Users/ma38727/bin/R-devel/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] batchtools_0.9.11 data.table_1.12.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        prettyunits_1.0.2 withr_2.1.2       digest_0.6.18
 [5] crayon_1.3.4      assertthat_0.2.0  rappdirs_0.3.1    R6_2.3.0
 [9] backports_1.1.3   magrittr_1.5      rlang_0.3.1       progress_1.2.0
[13] stringi_1.2.4     fs_1.2.6          brew_1.0-6        checkmate_1.9.1
[17] tools_3.6.0       hms_0.4.2         parallel_3.6.0    compiler_3.6.0
[21] pkgconfig_2.0.2   base64url_1.4

I don't think it's mccollect per se, but rather the finalizer on the R6 class running at the wrong time. I don't know enough about R6 classes to help further.

mschubert commented 5 years ago

I had a similar issue like this in the past and, if I remember correctly, the warning listed PIDs different from what I was trying to collect.

If that is the case here, I think suppressing them and making sure the requested PIDs are cleaned up (e.g. using tools::pskill) is the right approach. (Happy to be corrected by Martin or anyone else!)

vjcitn commented 4 years ago

This 1: In selectChildren(jobs, timeout) : cannot wait for child 59480 as it does not exist is a show stopper for multicore batchtools on centos. Do you need more details?

mllg commented 4 years ago

[The multicore stuff was changed very often in the latest R releases, thus I'm not sure if there is a generic solution. I've made a small fix for R-3.6.x which should reduce the number of warnings (8d471e128f7bc51399da516fdf35bde7d02f34c1). Does this help?

HenrikBengtsson commented 4 years ago

Hopefully, @mllg's commit fixes this problem, but if not ...

All y'all, the warning on cannot wait for child NNNNN as it does not exist was introduced in R 3.5.0. There were some bugs causing this warning to occur even if it should. It could be reproduced using the 'parallel' package alone. That particular problem was fixed in R 3.5.2.

For those who report seeing this warning, please make sure to share (a) what version of R you are using, and (b) what operating system you are on. Sharing you sessionInfo() covers both of this and more. If you're using R (>= 3.5.0 & < 3.5.2), then that's why you get the warning.

If it turns out that there is still a bug in R itself, it would be awesome to narrow this down so that it can be resolved there.

My $.02