Feature Request: Bring back --cluster-status equivalent?

juzb commented 6 months ago

Thanks for creating this Slurm-Plugin!

Would it be possible to allow the user to modify the status-check command, like it was the case in prior versions of snakemake with the --cluster-status CLI argument?

I mean this string specifying the sacct command here specifically: https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/739db189972570475a7b023ded16a62b3fcf4a74/snakemake_executor_plugin_slurm/__init__.py#L211

Perhaps a separate Issue, but the reason I'm asking is that snakemake freezes after finishing the first command because it cannot tell that the job is finished. It is printing this (in verbose mode) in regular intervals:

status_of_jobs after sacct is: {}
active_jobs_seen_by_sacct are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --name 38917ff1-c2bf-428a-b85b-1accc2f4846f
It took: 0.1006624698638916 seconds
The output is:
''

I'm trying to use something like this status-check script from the smk-simple-slurm repo

Would it also make sense to implement the basic idea of that jobid based script as an alternative to the name based existing option? This could perhaps be enabled by a status-check-via-jobid option. Or am I missing something here?

cmeesters commented 6 months ago

not sure, but can you please check the state of these jobs by invoking sacct in another shell? It happened today to me, that there was no process running on the respective node, but SLURM still showed the job as RUNNING. Just want to make sure this is not triggered by the plugin ...

juzb commented 6 months ago

Thanks for looking into this so quickly! When checking sacct manually (using the jobid), I get this:

$sacct -X --parsable2 --noheader --format=JobIdRaw,State -j 1729871
1729871|COMPLETED

Also the logfile of the job itself states that it finished successfully in it's final lines:

...
Finished job 0.
1 of 1 steps (100%) done
Storing output in storage.

So to me it looks like the jobs started by the plugin run fine and finish correctly, it's just that the next job is not started. But I've had this issue with prior snakemake versions too, just that there I could fix it with a custom job-status script, which seems to currently not be an option in this specific plugin.

cmeesters commented 6 months ago

Ok. At least no technical issue we cannot fix ... SLURM sometimes exhibits weird behaviour, so it was worth asking.

To the issue at hand: I noticed that the jobid works, but the name does not (returns an empty string). Can you confirm this? If so, which is your SLURM version (output of sinfo --version)?

In any case, I don't see the sense in re-introducing --cluster-status. Yet, introducing additional safeguards (e.g. printing warnings when a job is overdue, due to an error) and reportings make a lot of sense.

juzb commented 6 months ago

I think I've found the issue: I gave the jobs names via the slurm resource flags that would overwrite the name used by snakemake to track the jobs. Removing the -J <myjobname> argument from the resources: slurm-extra string solves the issue for me.

I wasn't aware that naming the jobs would produce conflicts, but can certainly remove them and we can close this issue.

In the long run, a more intuitive solution to me would be tracking jobs via their IDs (looks possible since snakemake prints these to me), and leaving the name for the user to modify
If this is more troublesome for technical reasons, a warning in the documentation, e.g. for the resources, slurm-extra keyword, might be a good idea
Prefixing the job name with the rule name would keep the name 'predictable' if that is important, and still allow for a more readable overview of what snakemake is doing via squeue for large submitted workflows (which 'inspired' me to name the jobs in the first place)

Just for completeness:

To the issue at hand: I noticed that the jobid works, but the name does not (returns an empty string). Can you confirm this?

Correct And this is the comparison of the two commands:

$ sacct -X --parsable2 --noheader --format=JobIdRaw,State --name 4b40caa5-8f4c-4a25-a7fb-7b626a0589b7
$ sacct -X --parsable2 --noheader --format=JobIdRaw,State,jobname -j 1729871
1729871|COMPLETED|<myjobname>

If so, which is your SLURM version (output of sinfo --version)?

It is slurm 23.02.7

cmeesters commented 6 months ago

In the long run, a more intuitive solution to me would be tracking jobs via their IDs (looks possible since snakemake prints these to me), and leaving the name for the user to modify

Yep, a minor change in job names is envisioned. Tracking by ID might break a workflow in the corner case, where admins allow otherwise unique IDs to be duplicated. This is SLURM behaviour when the IDs overflow, it starts again. Not likely to happen during a workflow run, but remotely possible.

If this is more troublesome for technical reasons, a warning in the documentation, e.g. for the resources, slurm-extra keyword, might be a good idea

Absolutely! I will work to clarify the docs. Promised!

Prefixing the job name with the rule name would keep the name 'predictable' if that is important, and still allow for a more readable overview of what snakemake is doing via squeue for large submitted workflows (which 'inspired' me to name the jobs in the first place)

I share this opinion and will introduce this minor change, a.s.a.p. (read: not within a week, I'm afraid).

juzb commented 6 months ago

Sounds perfect, thanks a lot! And without modifying the names, everything is running well - so no rush from my side!

snakemake / snakemake-executor-plugin-slurm

Feature Request: Bring back --cluster-status equivalent? #12