Closed juzb closed 6 months ago
not sure, but can you please check the state of these jobs by invoking sacct in another shell? It happened today to me, that there was no process running on the respective node, but SLURM still showed the job as RUNNING. Just want to make sure this is not triggered by the plugin ...
Thanks for looking into this so quickly! When checking sacct
manually (using the jobid), I get this:
$sacct -X --parsable2 --noheader --format=JobIdRaw,State -j 1729871
1729871|COMPLETED
Also the logfile of the job itself states that it finished successfully in it's final lines:
...
Finished job 0.
1 of 1 steps (100%) done
Storing output in storage.
So to me it looks like the jobs started by the plugin run fine and finish correctly, it's just that the next job is not started. But I've had this issue with prior snakemake versions too, just that there I could fix it with a custom job-status script, which seems to currently not be an option in this specific plugin.
Ok. At least no technical issue we cannot fix ... SLURM sometimes exhibits weird behaviour, so it was worth asking.
To the issue at hand: I noticed that the jobid works, but the name does not (returns an empty string). Can you confirm this? If so, which is your SLURM version (output of sinfo --version
)?
In any case, I don't see the sense in re-introducing --cluster-status
. Yet, introducing additional safeguards (e.g. printing warnings when a job is overdue, due to an error) and reportings make a lot of sense.
I think I've found the issue: I gave the jobs names via the slurm resource flags that would overwrite the name used by snakemake to track the jobs. Removing the -J <myjobname>
argument from the resources: slurm-extra
string solves the issue for me.
I wasn't aware that naming the jobs would produce conflicts, but can certainly remove them and we can close this issue.
resources
, slurm-extra
keyword, might be a good ideasqueue
for large submitted workflows (which 'inspired' me to name the jobs in the first place)Just for completeness:
To the issue at hand: I noticed that the jobid works, but the name does not (returns an empty string). Can you confirm this?
Correct And this is the comparison of the two commands:
$ sacct -X --parsable2 --noheader --format=JobIdRaw,State --name 4b40caa5-8f4c-4a25-a7fb-7b626a0589b7
$ sacct -X --parsable2 --noheader --format=JobIdRaw,State,jobname -j 1729871
1729871|COMPLETED|<myjobname>
If so, which is your SLURM version (output of sinfo --version)?
It is slurm 23.02.7
In the long run, a more intuitive solution to me would be tracking jobs via their IDs (looks possible since snakemake prints these to me), and leaving the name for the user to modify
Yep, a minor change in job names is envisioned. Tracking by ID might break a workflow in the corner case, where admins allow otherwise unique IDs to be duplicated. This is SLURM behaviour when the IDs overflow, it starts again. Not likely to happen during a workflow run, but remotely possible.
If this is more troublesome for technical reasons, a warning in the documentation, e.g. for the resources, slurm-extra keyword, might be a good idea
Absolutely! I will work to clarify the docs. Promised!
Prefixing the job name with the rule name would keep the name 'predictable' if that is important, and still allow for a more readable overview of what snakemake is doing via squeue for large submitted workflows (which 'inspired' me to name the jobs in the first place)
I share this opinion and will introduce this minor change, a.s.a.p. (read: not within a week, I'm afraid).
Sounds perfect, thanks a lot! And without modifying the names, everything is running well - so no rush from my side!
Thanks for creating this Slurm-Plugin!
Would it be possible to allow the user to modify the status-check command, like it was the case in prior versions of snakemake with the
--cluster-status
CLI argument?I mean this string specifying the
sacct
command here specifically: https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/739db189972570475a7b023ded16a62b3fcf4a74/snakemake_executor_plugin_slurm/__init__.py#L211Perhaps a separate Issue, but the reason I'm asking is that snakemake freezes after finishing the first command because it cannot tell that the job is finished. It is printing this (in
verbose
mode) in regular intervals:I'm trying to use something like this status-check script from the smk-simple-slurm repo
Would it also make sense to implement the basic idea of that
jobid
based script as an alternative to thename
based existing option? This could perhaps be enabled by astatus-check-via-jobid
option. Or am I missing something here?