Open gtsitsiridis opened 1 week ago
I'm afraid, that is correct. Pre-emption was not considered.
I wonder, how can this be fixed: When pre-emption occurs, should a job not be cancelled? Snakemake cannot proceed in this case anyway.
Subsequently, when the workflow is executed again, e.g. with --rerun-incomplete
, the new workflow instance is unaware of any previous SLURM jobs. If a predecessor job is still running, output files might be missing and consequently the job is started again. This might lead to corrupted files. The current instance cannot know that it has to wait for a job to finish.
Hence, the proposal to implement this: Upon pre-emption, cancel the job and trigger a big fat info that pre-emption took place, that a job has been cancelled and to recommend launching again with --rerun-incomplete
. Or do you have an alternative / better suggestion?
PS, out of curiosity: Why do your admins exercise pre-emption?
If there are not enough free resources, the jobs in the lower-priority partitions will be cancelled (i.e. PREEMPTed) and, by default, requeued. This executor considers preempted jobs as failed. Therefore, If multiple retries are enabled, preempted jobs will be both requeued are resubmited, spawning duplicate jobs.