Open stuvet opened 2 years ago
It would be possible to just not delete the job files (and let sweepRegistry()
handle this) or to introduce an additional option to turn this on or off. i tend to just leave the files there.
* Apart from needing to clean up the files afterwards, can you see any downsides of using `chunks.as.arrayjobs = TRUE` for single jobs too? If not, this could be a useful default setting for @HenrikBengtsson when submitting jobs from `future.batchtools`, simply to avoid triggering an unhandled error, and to allow jobs to requeue as expected (assuming backend configuration allows).
I've been working on slurm clusters where the support for array jobs is turned off, so this would be a problem.
I've been troubleshooting stability of
batchtools
when used on Slurm with the defaultmakeClusterFunctionsSlurm
(PR #276 & #277 ).The last (rare) error I can reproduce is:
Expected Behaviour
batchtools
should not report an expired status -> #277.Problem
Reprex
Cause
batchtools:::doJobCollection.character
deletes the jobCollection file.rds on the first run, so when the failed job gets requeued the file is no longer there, causing the error.Handling the error with an informative message would be helpful.
Workaround
chunks.as.arrayjobs = TRUE
in the resources request prevents this error (even if jobs are submitted singly) as it prevents the first run of the job deleting the jobCollection .RDS.future.batchtools
even though it doesn't result in array jobs.Questions
Apart from needing to clean up the files afterwards, can you see any downsides of using
chunks.as.arrayjobs = TRUE
for single jobs too? If not, this could be a useful default setting for @HenrikBengtsson when submitting jobs fromfuture.batchtools
, simply to avoid triggering an unhandled error, and to allow jobs to requeue as expected (assuming backend configuration allows).Perhaps a more explicit option would be better -
allow.requeue
orprevent.requeue
?