mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

batchtools doesn't correctly handle crashed jobs #212

Closed LeanderK closed 5 years ago

LeanderK commented 5 years ago

Apparently there's a problem with one of my algorithms, because when I type

getLog(2283)

i get:

`[39] "27: parallel::mcparallel(doJobCollection(jc, output = jc$log.file), mc.set.seed = FALSE)"
[40] "28: p$spawn(jc)"
[41] "29: reg$cluster.functions$submitJob(reg = reg, jc = jc)"
[42] "30: submitJobs(ids = ids_not_done)"
[43] "An irrecoverable exception occurred. R is aborting now ..." `

the problem is, that batch tools doesn't correctly recognize that the job crashed and doesn't quit the job.

mllg commented 5 years ago

It looks like you are using the multicore mode, and at least one thread crashed. I'm afraid there is not much I can do, in these cases the master R session freezes and I completely loose control. I've heard rumors that the R core team has improved forking in very recent versions of R (fixes maybe still in R devel), so this eventually gets solved by upgrading R.

Alternatively, you could try to switch to the socket backend, or, if you are really patient, wait until processx is implemented as a cluster function backend.

LeanderK commented 5 years ago

hmm, that's unfortunate. thanks for your work!