snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

Preempted jobs are spawned multiple times #109

Closed gtsitsiridis closed 2 months ago

gtsitsiridis commented 4 months ago

If there are not enough free resources, the jobs in the lower-priority partitions will be cancelled (i.e. PREEMPTed) and, by default, requeued. This executor considers preempted jobs as failed. Therefore, If multiple retries are enabled, preempted jobs will be both requeued are resubmited, spawning duplicate jobs.

cmeesters commented 4 months ago

I'm afraid, that is correct. Pre-emption was not considered.

I wonder, how can this be fixed: When pre-emption occurs, should a job not be cancelled? Snakemake cannot proceed in this case anyway.

Subsequently, when the workflow is executed again, e.g. with --rerun-incomplete, the new workflow instance is unaware of any previous SLURM jobs. If a predecessor job is still running, output files might be missing and consequently the job is started again. This might lead to corrupted files. The current instance cannot know that it has to wait for a job to finish.

Hence, the proposal to implement this: Upon pre-emption, cancel the job and trigger a big fat info that pre-emption took place, that a job has been cancelled and to recommend launching again with --rerun-incomplete. Or do you have an alternative / better suggestion?

cmeesters commented 4 months ago

PS, out of curiosity: Why do your admins exercise pre-emption?

tbigot commented 3 months ago

Hi, I'm encountering a similar problem with the cluster at my institution (Institut Pasteur). Here, we can use other teams' machines by specifying a particular partition. Jobs launched on these partitions have a higher chance of being executed sooner, thus finishing faster overall. However, if the machine's owners submit a job, our jobs are preempted with automatic requeuing, which means they will be relaunched on any machine with required resources available. Essentially, it's as if the jobs were waiting during this short lapse of time.

With the behavior of snakemake-executor-plugin-slurm, this job is considered failed and requires a retry to be re-submitted by snakemake. I would really like the plugin to provide an option to not consider the job as failed when preempted, but rather as waiting, running, or pending, because in our case, it will eventually be executed. Using Snakemake’s retry strategy would require deactivating SLURM's auto-requeue of preempted jobs (it seems to be what @gtsitsiridis reports here, could be solved adding --no-requeue to the slurm command), but to me this is not very relevant for these reasons:

I wrote a similar explanation a few years ago on this StackOverflow issue: Snakemake: Job preemption can interrupt running jobs on clusters, how to make sure that the task is not considered as failed?. Please note that the bug related to IncompleteFilesException has since been resolved.

In my opinion, to address this, this line here could be conditioned with a plugin parameter (something like preemption_is_not_failed).

cmeesters commented 3 months ago

just did not find the time sooner: Please try PR #132 . We do not exercise pre-emption on our cluster, so I cannot really test it.

My idea is to issue a big, fat warning and ask users to keep Snakemake running, if possible. Resuming and cancelling should work as intended.

tbigot commented 3 months ago

I will test it soon, thank you very much for this quick fix, but I would like to draw your attention to the fact that not all Slurm clusters are configured to automatically requeue preempted jobs. This depends on the value of JobRequeue in the Slurm configuration. This behavior can also be modified by the user by specifying --no-requeue or, conversely, --requeue.

In the case where JobRequeue=0 or where the user has specified --no-requeue, the job will be permanently in the PREEMPTED state, which will lead Snakemake to wait forever, considering that the jobs are still running. The message Leave Snakemake running, if possible may make less sense since the other jobs will continue to execute, but you will need to restart Snakemake for all the preempted jobs to be executed in a new attempt.

cmeesters commented 3 months ago

absolutely - but I rather let a new colleague of mine add a new explicit flag in a separate PR (would do a separate PR anyway, as this is something sematically different)

tbigot commented 3 months ago

The PR works fine, thanks! Some improvements, maybe: in the warning message, indicating the Snakemake internal job number and the SLURM job ID would help to track the preempted job.

cmeesters commented 3 months ago

Hm, I figured: If pre-emption occurs, then for all currently running jobs. That can be a mouthful! Moreover, it might happen more than once. Hence, I implemented a “report once, report generic” policy, thinking that otherwise other complaints will pile up. sacct / squeue will still give you the job IDs.

Any thoughts on this?

tbigot commented 2 months ago

Actually, I am encountering a problem with this fix. Indeed, the preempted job is no longer considered a failure, and the execution continues. However, Snakemake never detects that this job has completed.

I believe this is due to the fact that when the job is preempted, it is no longer considered 'still running' because it is not yielded as an active job at the end of the function check_active_jobs.

cmeesters commented 2 months ago

Hi,

You are absolutely right, I forgot that one line. Now, I released a hot-fix release right now. Should fix it.

tbigot commented 2 months ago

Thanks a lot. I did not test your release, but adding the yield on my side solved the problem.

A question remains: What about indicating the job ID concerned by the preemption in the warning? If we're going to display a warning message, we might as well make sure it indicates the job ID concerned.

I propose something like

                elif status == "PREEMPTED":
                    self.logger.warning("Slurm Job {j.external_jobid} was preempted.")
                    if not self._preemption_warning:
                        self._preemption_warning = True
                        self.logger.warning(
                            f"""
===== A Job preemption  occured! =====
Leave Snakemake running, if possible. Otherwise Snakemake
needs to restart this job upon a Snakemake restart.

We leave it to SLURM to resume your job(s)"""
                        )
cmeesters commented 2 months ago

I figured: pre-emption can effect multiple jobs and may occur multiple times. If one particular ID is reported once, there is little sense in reporting it at all, given the fact that others might not be reported at all.

What I want to add as a feature, though: To be able to indicate the previous jobname, in case of a crash of Snakemake. Then the query of jobs will be equal for the old and the new instance of Snakemake.

To you too, I will write: I am happy to consider PRs, but all own contributions will have to wait until (mid)-October, I'm afraid.

hudja commented 2 months ago

Hello, I have a similar problem here. Some of my jobs fail with NODE_FAIL status; these jobs are automatically requeued by SLURM with the same SLURM ID and then finish successfully. However, Snakemake does not detect the requeue and mark these jobs as failed ones. So, I need to restart the whole workflow. If I will use retries: 1 in SM, I will get conflicting output files with requeued jobs. If I will disable requeueing in SLURM with --no-requeue, NODE_FAIL jobs won't be rerun. Is it possible to overcome this problem somehow?

cmeesters commented 2 months ago

all own contributions will have to wait until (mid)-October, I'm afraid.

cmeesters commented 1 month ago

the updated documentation explains the new flags in detail