snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

job submitted in login node not being cancelled if all rules fail? #127

Open ifariasg opened 3 months ago

ifariasg commented 3 months ago

I'm trying to update my snakemake workflow to v8+

My nuisance so far is the following. I submit my workflows as a job in the login node from where snakemake with the executor do their thing. Problem is, if the jobs spawned by the executor fail, the job in the login node is not terminated leaving me with the "nuisance" of having to manually "scancel" my failed snakemake job. The log does not help either, it just shows the submission of jobs, so I'm guessing snakemake is not detecting the failed downstream jobs.

Is this an intended behavior? Am I missing an option that needs to be activated with the executor?

Previously, when I submitted a job in the login node via sbatch and snakemake v7 with the old slurm profile, upon a job failing, the job in the login node would immediately terminate (as intended, as the workflow could not continue because of downstream jobs failing).

cmeesters commented 3 months ago

This is absolutely not intended behaviour. However, Snakemake will keep running until a last job has been finished (well or otherwise) to enable proper restart without loosing too many resources. Cancelling the workflow itself, will cancel all jobs. Cancelling or finishing will take time, though. Snakemake needs to check files locally and we do intentionally not check a job status every second (more configurability will be implemented, soon-ish).

That being said: Can you please run your workflow with --verbose and attach the log? Please indicate the Snakemake and plugin versions. Thank you.

ifariasg commented 3 months ago

I attached the log for the job in the login node and the submitted job that fails. Adding --verbose did not change what is captured in the logs. failed_job.log master_job.log

cmeesters commented 3 months ago

Interesting workflow. If not already there, I hope you include in the workflow catalog.

Anyway, I meant that a workfow execution with failing jobs with --verbose gets captured. It does indeed look like the verbose flag is not in effect. Weird. What is your command line?

ifariasg commented 3 months ago

Interesting workflow. If not already there, I hope you include in the workflow catalog.

Anyway, I meant that a workfow execution with failing jobs with --verbose gets captured. It does indeed look like the verbose flag is not in effect. Weird. What is your command line?

I've been using the following to run my workflow: snakemake build_run_analyze --cores 512 --jobs 12 --configfile config/config.yaml --snakefile Snakefile_p.smk --executor slurm --verbose

I also added the job script itself in case it helps shed some light. lhm_50_test.txt

cmeesters commented 3 months ago

Ah, running within a job context might give some issues. I have an idea to work on it, but could you try running your snakemake command on the login node?

ifariasg commented 3 months ago

Ah, running within a job context might give some issues. I have an idea to work on it, but could you try running your snakemake command on the login node?

Just did that and the result is exactly the same, the log output does not change at all and I have to manually terminate the snakemake run. One thing that I noticed is that in ALL of my snakemake runs the job is submitted with the same ID according to snakemake, its ALWAYS jobid 128

Job 6ba15ce9-7f9c-5763-a0e2-85de61553374 has been submitted with SLURM jobid 128 (log: bla bla bla/128.log).

But in reality all jobs have different job ids as I would expect.

cmeesters commented 3 months ago

That is another strange observation. Note, the job name is constant, as it is used to keep track of the jobs within one workflow run. When another workflow is started, it is different. The SLURM job IDs, however, ought to be incremented by SLURM itself. If this is not the case, please run:

$ sinfo --version
$ sacct -u $USER
ifariasg commented 3 months ago
sacct -u $USER

The job IDs are indeed incremented by slurm correctly, I just pointed out that log does not show this same job IDs. Regarding the job names, I can only say they indeed change from one workflow run to another, but I cant confirm if it remains constant as currently my workflow is failing in the first job submission.

Either way, this is the output of the commands. even though it seems unrelated? image

cmeesters commented 3 months ago

That the job(s) are failing is one issue. Explainable by the error message numba.core.errors.TypingError: Use of unsupported NumPy function 'numpy.empty' or unsupported use of the function . That Snakemake continues to run as the shepherd for still running jobs is explainable as well (I already tried), it will try to finish as much as possible.

However, if understood correctly, ending Snakemake with CTRL+c or sending SIGINT did not cancel the jobs, and you have to do it manually? And the second Snakemake related issue is that only one constant SLURM JOBID gets reported.

Well, your logs, which were in a SLURM job, do not indicate the submission of multiple jobs, but of one job, only.

Do you have a public repo and test data to try?

ifariasg commented 3 months ago

That the job(s) are failing is one issue. Explainable by the error message numba.core.errors.TypingError: Use of unsupported NumPy function 'numpy.empty' or unsupported use of the function . That Snakemake continues to run as the shepherd for still running jobs is explainable as well (I already tried), it will try to finish as much as possible.

However, if understood correctly, ending Snakemake with CTRL+c or sending SIGINT did not cancel the jobs, and you have to do it manually? And the second Snakemake related issue is that only one constant SLURM JOBID gets reported.

Well, your logs, which were in a SLURM job, do not indicate the submission of multiple jobs, but of one job, only.

Do you have a public repo and test data to try?

I think some things might not clear from my explanation.

Just to give you some background, I'm testing migrating my workflow from snakemake 7 to 8 and updating a lot of dependencies in the process (hence , the np.empty error, this is not the issue I'm talking about). My issue is just about the "nuisance" of having to manually terminate snakemake, either via job context (using scancel), or directly in shell (using ctrl+c), whereas with the "old" slurm profile, whenever a job failed, snakemake would terminate immediately.

As a secondary finding, I reported that when using the plugin, the log always show that jobs are submitted with "jobID" 128, but the jobs have other job IDs that are given by slurm, I have no idea if this is a problem, just pointed it out.

For reproducibility, I created a very simple workflow:

rule test:
  input:
    "script.py",
    "data.txt"
  output:
    "out.txt"
  script:
    "script.py"

With a script that fails on purpose:

import numpy as np

test = np.empty(0,0)

And I run it with the following command line: snakemake --executor slurm --jobs 1 --verbose

The python script fails immediately but snakemake carries on, this time though the verbose keyword is generating things: 2024-08-09T110810.796577.snakemake.log I would expect snakemake to terminate as rule test failed, but that is not happening even though, according to the log, it is detecting the submitted job as failed. I terminated snakemake manually with ctrl+c in this case.

I hope this is more clear and allows to better track the issue (if any)!

cmeesters commented 3 months ago

My issue is just about the "nuisance" of having to manually terminate snakemake, either via job context (using scancel), or directly in shell (using ctrl+c), whereas with the "old" slurm profile, whenever a job failed, snakemake would terminate immediately.

When working with the profile, SLURM support was not an integral part of Snakemake. Snakemake had no idea of what was going on. Know, the plugin will report failed jobs, immediately when the status is known (to avoid straining the scheduler DB we ask at certain intervals, so there is no immediate feedback - otherwise a number of admins would not allow Snakemake to run on "their" cluster with a good reason). It continues working as still running jobs might create valuable results. And a user can always restart a workflow to complete them. If you, however, are observing the run and want to fix things during development, you need to signal Snakemake (CTRL-c or otherwise). To fail upon error is a feature, I could implement. It will not be a default.

Regarding the job ID: your master log only shows one submitted job. Does it really look like this when run on the login node?

ifariasg commented 3 months ago

My issue is just about the "nuisance" of having to manually terminate snakemake, either via job context (using scancel), or directly in shell (using ctrl+c), whereas with the "old" slurm profile, whenever a job failed, snakemake would terminate immediately.

When working with the profile, SLURM support was not an integral part of Snakemake. Snakemake had no idea of what was going on. Know, the plugin will report failed jobs, immediately when the status is known (to avoid straining the scheduler DB we ask at certain intervals, so there is no immediate feedback - otherwise a number of admins would not allow Snakemake to run on "their" cluster with a good reason). It continues working as still running jobs might create valuable results. And a user can always restart a workflow to complete them. If you, however, are observing the run and want to fix things during development, you need to signal Snakemake (CTRL-c or otherwise). To fail upon error is a feature, I could implement. It will not be a default.

Regarding the job ID: your master log only shows one submitted job. Does it really look like this when run on the login node?

Ok, so according to your answer it is indeed intended behavior then. As I mentioned before, I opened this issue because in snakemake v7, the default behavior is terminate snakemake upon job failure UNLESS you use the --keep-going flag. Even then, it will wait for all possible jobs to finish and after, terminate snakemake if there are no more jobs in the pipeline.

I still feel it does not make sense that if I run the test workflow without --executor (snakemake --jobs 1 --verbose), snakemake terminates immediately after the "rule test" fails. In your answer you say:

It continues working as still running jobs might create valuable results

but there are NO other downstream/parallel jobs running, yet snakemake is kept "waiting" for something that will never happen. And again, we have the --keep-going flag for that.

Regarding the log in my last answer, that was a run in the login node via shell, and it only shows one job because there is only one job being submitted.

cmeesters commented 2 months ago

I checked: --keep-going will run the workflow (and continues to submit jobs) like it would with local execution until Snakemake encounters a job with missing input. Hence, a significantly higher portion of the workflow can be executed. But, no, there is no cancellation of the entire workflow.

For this, I can implement a new (boolean) flag. It might be handy for development purposes.

ifariasg commented 2 months ago

I checked: --keep-going will run the workflow (and continues to submit jobs) like it would with local execution until Snakemake encounters a job with missing input. Hence, a significantly higher portion of the workflow can be executed. But, no, there is no cancellation of the entire workflow.

For this, I can implement a new (boolean) flag. It might be handy for development purposes.

Having a boolean flag would be really useful! It will at least allow to recover the previous bahevior from snakemake 7, which in my opinion is indeed much better for development/debugging stages.

cmeesters commented 2 months ago

made a PR, please check it (#139)