martius-lab / cluster_utils

https://cluster-utils.readthedocs.io/stable/
Other
12 stars 1 forks source link

Slurm: Add new option to send USR1 signal before timeout #84

Closed luator closed 7 months ago

luator commented 7 months ago

TL;DR

Add option to send signal to jobs before they are killed by timeout.

Long Version

Add a new option timeout_signal_time in the cluster_requirements section. If set, it adds --signal=USR1@<time> to the sbatch call, resulting in Slurm to send a USR1 signal at the specified time (in seconds) before it would kill the job due to timeout.

For the signal to actually be sent to the process running the main script, this needs to be executed with srun, so some change in the job command creation is needed to inject that for Slurm jobs.

Also add an example and embed it in the documentation.

Resolves #64.

How I Tested

By running the example.

luator commented 7 months ago
  1. I think adding ..._in_secs is a good idea to make the format/unit clear. I'm generally not so happy with the name timeout_signal_time, as I think it's not very descriptive. Something like seconds_before_timeout_at_which_signal_is_sent would be much clearer but also annoyingly long...
    I wouldn't unifying the format with request_time as the parsing done by Slurm for that is actually not so trivial (from the sbatch manpage: 'Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".'), so I'd rather not deal with that in our code.
  2. Yes, I tested with Singularity because I was also a bit worried about that. It was working fine, so seems the signal is properly passed on by Singularity. From what I understood, it might not work directly if you have a bash script wrapping your main script, though (but I didn't explicitly test that).
  3. Sounds good to me. It will make it a bit less flexible but much easier for the user. I'd indeed rather do that in a separate PR, though.
luator commented 7 months ago

How about signal_seconds_before_timeout for the parameter name?

mseitzer commented 7 months ago

signal_seconds_to_timeout would be a bit shorter even, but works for me with "before" as well.

mseitzer commented 7 months ago

2. From what I understood, it might not work directly if you have a bash script wrapping your main script, though (but I didn't explicitly test that).

Maybe noting this in the docs would be nice then.

luator commented 7 months ago

Maybe noting this in the docs would be nice then.

I'll actually do an explicit test first, so we know for sure.

luator commented 7 months ago

Wrapper scripts are indeed a problem for the signal. I couldn't directly find a proper way of handling it in this case, so I just added a warning to the docs for now.