rwth-i6 / sisyphus

A Workflow Manager in Python
Mozilla Public License 2.0
45 stars 24 forks source link

Handling of requeueing in SLURM #201

Open dthulke opened 3 months ago

dthulke commented 3 months ago

SLURM can automatically requeue jobs (e.g. on node failure or preemption of a higher priority job: https://slurm.schedmd.com/sbatch.html#OPT_requeue). In general this is similar to the resume function we have in sisyphus with the added bonus that jobs keep their priority.

If this is enabled (i.e. if you don't specify the flag in sbatch the default is defined by the slurm.conf), this causes a few issues:

  1. As the job id does not change, the log file of the previous run is overwritten (this actually triggered me to look into this)
  2. Non-resumable tasks are resumed

Alternatively, both issues would be fixed by always setting --no-requeue but then we would loose the advantages for resumable jobs.

Are there any other opinions? If not I'd create a PR for the two fixes.

JackTemaki commented 3 months ago

For me your proposed options sound valid. For the log file I see no issues at all, for the second one this maybe needs an additional look but should also be fine.

critias commented 3 months ago

The local engine already appends it's log to the last log file. I think it's a good idea to have a clearly visible separation between different entries similar to this: https://github.com/rwth-i6/sisyphus/blob/a22e9236ef2a0dcb62fc322bd012f9d0f4e95063/sisyphus/worker.py#L206

Beside that appending to the existing log file sounds good to me.