Handling of requeueing in SLURM

dthulke commented 3 months ago

SLURM can automatically requeue jobs (e.g. on node failure or preemption of a higher priority job: https://slurm.schedmd.com/sbatch.html#OPT_requeue). In general this is similar to the resume function we have in sisyphus with the added bonus that jobs keep their priority.

If this is enabled (i.e. if you don't specify the flag in sbatch the default is defined by the slurm.conf), this causes a few issues:

As the job id does not change, the log file of the previous run is overwritten (this actually triggered me to look into this)
- The nicest option would be to be able to create separate files under engine/ for each run (that's the behaviour as without requeue as the slurm job id changes). But this is afaik not possible as the restart number is not available in the corresponding file pattern: https://slurm.schedmd.com/sbatch.html#SECTION_FILENAME-PATTERN
- Set --open-mode=append https://slurm.schedmd.com/sbatch.html#OPT_open-mode so that the previous log file is kept in the same file <-- my preferred solution
Non-resumable tasks are resumed
- This would be easy to fix by always setting --no-requeue (https://slurm.schedmd.com/sbatch.html#OPT_no-requeue) for non-resumable tasks. But, this would require to pass the information whether a task is resumable to the submit call function https://github.com/rwth-i6/sisyphus/blob/a22e9236ef2a0dcb62fc322bd012f9d0f4e95063/sisyphus/engine.py#L36 what would also potentially break custom engine implementations (but should be an easy fix and I only know of a single custom engine implementation by @Zettelkasten). <-- my preferred solution

Alternatively, both issues would be fixed by always setting --no-requeue but then we would loose the advantages for resumable jobs.

Are there any other opinions? If not I'd create a PR for the two fixes.

JackTemaki commented 3 months ago

For me your proposed options sound valid. For the log file I see no issues at all, for the second one this maybe needs an additional look but should also be fine.

critias commented 3 months ago

The local engine already appends it's log to the last log file. I think it's a good idea to have a clearly visible separation between different entries similar to this: https://github.com/rwth-i6/sisyphus/blob/a22e9236ef2a0dcb62fc322bd012f9d0f4e95063/sisyphus/worker.py#L206

Beside that appending to the existing log file sounds good to me.

rwth-i6 / sisyphus

Handling of requeueing in SLURM #201