mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

explicitly undo side effects of makeCluster #220

Open mtmorgan opened 5 years ago

mtmorgan commented 5 years ago

Some makeCluster* operations have side effects, e.g., opening connections

> nrow(showConnections())
[1] 0
> cl = makeClusterFunctionsSocket(2)
> nrow(showConnections())
[1] 2

There is no way to 'undo' (e.g., destroyCluster(cl)) these side-effects, and they are not destroyed by, e.g., removeRegistry(). I realize that there is a finalizer, so

> rm(cl)
> gc()
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  542172 29.0     934164 49.9         NA   934164 49.9
Vcells 1473945 11.3    8388608 64.0      32768  8386115 64.0
> nrow(showConnections())
[1] 0

often works, but actually finalizers are not run in a deterministic order so that the this is not robust.

mllg commented 5 years ago

Thanks for reporting. I guess I need to implement something like cf$startCluster() and cf$stopCluster() and call it internally in submitJobs(). This has the drawback that submitJobs() would have to wait for all jobs to finish, and thus asynchronicity is lost.

Just out of curiosity, where did this come up? Is this a problem while running R CMD check or for real world applications?

mtmorgan commented 5 years ago

It is related to #221 and to checks in https://github.com/BiocParallel, both of which have consequence in real-world applications (I think). The connections are still open because the finalizer hasn't run. When it does run, the order in which the finalizer runs is not deterministic (https://stat.ethz.ch/pipermail/r-devel/2011-July/061612.html; it's added to a linked list of SEXP; the order of elements in the linked list depends on what other objects are added to / removed from the linked list; periodically, the finalizer runs at a time when symbols referenced by the finalizer (e.g., the socket connection used in serialize() to send the "DONE" signal to the worker) have already been cleaned up; this signals an error).

For my use case I would be happy to be able to enforce synchronicity by calling stopCluster() directly (I don't think the 'user' has access to the Socket instance directly?)