Open dthulke opened 3 months ago
For me your proposed options sound valid. For the log file I see no issues at all, for the second one this maybe needs an additional look but should also be fine.
The local engine already appends it's log to the last log file. I think it's a good idea to have a clearly visible separation between different entries similar to this: https://github.com/rwth-i6/sisyphus/blob/a22e9236ef2a0dcb62fc322bd012f9d0f4e95063/sisyphus/worker.py#L206
Beside that appending to the existing log file sounds good to me.
SLURM can automatically requeue jobs (e.g. on node failure or preemption of a higher priority job: https://slurm.schedmd.com/sbatch.html#OPT_requeue). In general this is similar to the resume function we have in sisyphus with the added bonus that jobs keep their priority.
If this is enabled (i.e. if you don't specify the flag in sbatch the default is defined by the slurm.conf), this causes a few issues:
engine/
for each run (that's the behaviour as without requeue as the slurm job id changes). But this is afaik not possible as the restart number is not available in the corresponding file pattern: https://slurm.schedmd.com/sbatch.html#SECTION_FILENAME-PATTERN--open-mode=append
https://slurm.schedmd.com/sbatch.html#OPT_open-mode so that the previous log file is kept in the same file <-- my preferred solution--no-requeue
(https://slurm.schedmd.com/sbatch.html#OPT_no-requeue) for non-resumable tasks. But, this would require to pass the information whether a task is resumable to the submit call function https://github.com/rwth-i6/sisyphus/blob/a22e9236ef2a0dcb62fc322bd012f9d0f4e95063/sisyphus/engine.py#L36 what would also potentially break custom engine implementations (but should be an easy fix and I only know of a single custom engine implementation by @Zettelkasten). <-- my preferred solutionAlternatively, both issues would be fixed by always setting
--no-requeue
but then we would loose the advantages for resumable jobs.Are there any other opinions? If not I'd create a PR for the two fixes.