mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
169 stars 51 forks source link

Job requeue on Slurm - 'no such file or directory' & workaround #280

Open stuvet opened 2 years ago

stuvet commented 2 years ago

I've been troubleshooting stability of batchtools when used on Slurm with the default makeClusterFunctionsSlurm (PR #276 & #277 ).

The last (rare) error I can reproduce is:

Expected Behaviour

Problem

Reprex

Error in gzfile(file, "rb") : cannot open the connection
Calls: <Anonymous> -> doJobCollection.character -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '.../jobs/job929872958e6074e5662a4c9hd3f312f4.rds', probable reason 'No such file or directory'

Cause

Workaround

Questions

mllg commented 2 years ago

It would be possible to just not delete the job files (and let sweepRegistry() handle this) or to introduce an additional option to turn this on or off. i tend to just leave the files there.

* Apart from needing to clean up the files afterwards, can you see any downsides of using `chunks.as.arrayjobs = TRUE` for single jobs too? If not, this could be a useful default setting for @HenrikBengtsson when submitting jobs from `future.batchtools`, simply to avoid triggering an unhandled error, and to allow jobs to requeue as expected (assuming backend configuration allows).

I've been working on slurm clusters where the support for array jobs is turned off, so this would be a problem.