Closed cmeesters closed 8 months ago
As far as I see, using the same name for all jobs of a workflow has the advantage of reducing the number of queries to slurm, since we can get all submitted/running jobs of that workflow with a single query (e.g. sacct -X --parsable2 --noheader --format=JobIdRaw,State --name 38917ff1-c2bf-428a-b85b-1accc2f4846f
).
If we prefix the job name with the rule name, then I think we'd have to go back and query each job individually by the job id, no?
No, you are right, my idea is nonsense, and your PR will be the remedy to display human-readable output. -- I stand corrected. Thanks for pointing this out.
I would like to point out that in the cluster where I work, comments are not stored after the job ends, so it is not easy to work with sacct or reportseff (https://github.com/troycomi/reportseff).
Prefixing the jobname with the rule name would save me a lot of time when developing my workflows. Would it be possible to have this as an option?
On the contrary, so far, I've had no use of the following:
we can get all submitted/running jobs of that workflow with a single query
Getting all submitted/running jobs of that workflow with a single query is used bysnakemake
when checking job status. Instead of querying sacct
once per job, it does it for all jobs (since all have the same job name). This way we avoid overloading sacct
.
Ah, I understand now. Thanks. This is good from the point of view of how snakemake interacts with slurm, but this is not convenient for "forensics".
For every broken job, you get information by Snakemake, incl. the log files (albeit, that the Snakemake logfiles are informative, whereas the SLURM-Logfiles just display the terminal output of that one job).
If your program in question does not give you any hints, when breaking or you write a shell script, you can always add some more verbose output to it (e.g. by printing the variables). When using the script
directive, you can run this rule in question with --debug
for Python.
Snakemake itself will print the SLURM_JOB_ID
, when a job fails.
One of my goal is to better adjust the resources requested by each rule.
I used to do that using the benchmark
directive, and then parsing the generated files, but reportseff, recommended by my cluster admins, turned out to be much more practical for my needs. The only problem is that it doesn't have access to the comment associated with the jobs, so it is not easy to figure out the rules for which resource usage information is reported.
Currently, my solution is to use a wrapper around reportseff that parses the .snakemake/slurm_logs/<rulename>
folder to extract the jobids from the log file names, and then uses those jobids with reportseff to get only the reports for a given rule.
Currently, the jobname is the
run_uuid
- some computer readable gibberish. It is proposed to prefix this with the rule name for better readability during runs and workflow debugging on a cluster.