Slurm: Send signal if job is about to time out.

Slurm will terminate jobs if they take longer than the requested time. However, it can be configured to send a signal a bit before that, so the job can save a checkpoint.

For this to work, two changes are necessary:

Add the argument --signal=USR1@120 to the sbatch script. This will tell Slurm to send a SIGUSR1 (a user-defined signal) at least 120 seconds before the timeout. Might make sense to make the 120 configurable, for jobs where saving takes longer.
Execute the user script with srun (i.e. srun python ... instead of directly python ...) in the sbatch script. Otherwise the signal will not be send to the Python process.

The user then still needs to set up signal handler in their code to detect it and take appropriate actions (e.g. save a checkpoint) but since this is very task-specific, it is up to the user to take care of it. A simple example should be added to the documentation, though.

For reference, here is a working example which doesn't use cluster_utils:

sbatch script:

#!/bin/bash

#SBATCH --partition=cpu-galvani
#SBATCH --time=00:03:00  # job will be cancelled after 3 minutes
#SBATCH --signal=USR1@30  # signal about 30 seconds before timeout
#SBATCH --output=output-%j.out

echo "Start execution"
srun python3.11 ./signal_listening_job.py
rc=$?

echo "signal_listening_job.py terminated with exit code ${rc}"

signal_listening_job.py:

"""A simple example script that does some dummy work and listens for a stop signal."""
import signal
import sys
import time

class Worker:

    N_STEPS = 100
    STEP_DURATION_S = 10

    def __init__(self) -> None:
        self.step = 0
        self.received_signal = False

        # register signal handler
        signal.signal(signal.SIGUSR1, self.signal_handler_sigusr1)

    def signal_handler_sigusr1(self, sig, frame):
        """This method gets called when the process receives the SIGUSR1 signal."""
        print("Received SIGUSR1, prepare for shutdown.", flush=True)
        self.received_signal = True

    def run(self) -> None:
        """Do the actual work here."""
        while (self.step < self.N_STEPS) and not self.received_signal:
            print("Step {}".format(self.step), flush=True)
            # do some dummy work (aka sleep)
            time.sleep(self.STEP_DURATION_S)

            self.step += 1

        if self.received_signal:
            print("Save current status and exit")
            time.sleep(5)
            ...
            # Nothing saved in this example but in an actual application this might be
            # the place to store a snapshot.

            # Simply return with a custom error code in this example.  When using
            # cluster utils, this would be the place to call `exit_for_resume()`.
            return 3

        return 0

if __name__ == "__main__":
    sys.exit(Worker().run())

martius-lab / cluster_utils

Slurm: Send signal if job is about to time out. #64