snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
15 stars 17 forks source link

prefix jobname with rule name #17

Closed cmeesters closed 8 months ago

cmeesters commented 8 months ago

Currently, the jobname is the run_uuid - some computer readable gibberish. It is proposed to prefix this with the rule name for better readability during runs and workflow debugging on a cluster.

fgvieira commented 8 months ago

As far as I see, using the same name for all jobs of a workflow has the advantage of reducing the number of queries to slurm, since we can get all submitted/running jobs of that workflow with a single query (e.g. sacct -X --parsable2 --noheader --format=JobIdRaw,State --name 38917ff1-c2bf-428a-b85b-1accc2f4846f).

If we prefix the job name with the rule name, then I think we'd have to go back and query each job individually by the job id, no?

cmeesters commented 8 months ago

No, you are right, my idea is nonsense, and your PR will be the remedy to display human-readable output. -- I stand corrected. Thanks for pointing this out.

blaiseli commented 1 month ago

I would like to point out that in the cluster where I work, comments are not stored after the job ends, so it is not easy to work with sacct or reportseff (https://github.com/troycomi/reportseff).

Prefixing the jobname with the rule name would save me a lot of time when developing my workflows. Would it be possible to have this as an option?

On the contrary, so far, I've had no use of the following:

we can get all submitted/running jobs of that workflow with a single query

fgvieira commented 1 month ago

Getting all submitted/running jobs of that workflow with a single query is used bysnakemake when checking job status. Instead of querying sacct once per job, it does it for all jobs (since all have the same job name). This way we avoid overloading sacct.

blaiseli commented 1 month ago

Ah, I understand now. Thanks. This is good from the point of view of how snakemake interacts with slurm, but this is not convenient for "forensics".

cmeesters commented 1 month ago

For every broken job, you get information by Snakemake, incl. the log files (albeit, that the Snakemake logfiles are informative, whereas the SLURM-Logfiles just display the terminal output of that one job).

If your program in question does not give you any hints, when breaking or you write a shell script, you can always add some more verbose output to it (e.g. by printing the variables). When using the script directive, you can run this rule in question with --debug for Python.

Snakemake itself will print the SLURM_JOB_ID, when a job fails.

blaiseli commented 1 month ago

One of my goal is to better adjust the resources requested by each rule.

I used to do that using the benchmark directive, and then parsing the generated files, but reportseff, recommended by my cluster admins, turned out to be much more practical for my needs. The only problem is that it doesn't have access to the comment associated with the jobs, so it is not easy to figure out the rules for which resource usage information is reported.

Currently, my solution is to use a wrapper around reportseff that parses the .snakemake/slurm_logs/<rulename> folder to extract the jobids from the log file names, and then uses those jobids with reportseff to get only the reports for a given rule.