snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
10 stars 14 forks source link

Preempted jobs are spawned multiple times #109

Open gtsitsiridis opened 1 week ago

gtsitsiridis commented 1 week ago

If there are not enough free resources, the jobs in the lower-priority partitions will be cancelled (i.e. PREEMPTed) and, by default, requeued. This executor considers preempted jobs as failed. Therefore, If multiple retries are enabled, preempted jobs will be both requeued are resubmited, spawning duplicate jobs.

cmeesters commented 1 week ago

I'm afraid, that is correct. Pre-emption was not considered.

I wonder, how can this be fixed: When pre-emption occurs, should a job not be cancelled? Snakemake cannot proceed in this case anyway.

Subsequently, when the workflow is executed again, e.g. with --rerun-incomplete, the new workflow instance is unaware of any previous SLURM jobs. If a predecessor job is still running, output files might be missing and consequently the job is started again. This might lead to corrupted files. The current instance cannot know that it has to wait for a job to finish.

Hence, the proposal to implement this: Upon pre-emption, cancel the job and trigger a big fat info that pre-emption took place, that a job has been cancelled and to recommend launching again with --rerun-incomplete. Or do you have an alternative / better suggestion?

cmeesters commented 1 week ago

PS, out of curiosity: Why do your admins exercise pre-emption?