martius-lab / cluster_utils

https://cluster-utils.readthedocs.io/stable/
Other
12 stars 0 forks source link

Slurm: Send signal if job is about to time out. #64

Closed luator closed 6 months ago

luator commented 7 months ago

Slurm will terminate jobs if they take longer than the requested time. However, it can be configured to send a signal a bit before that, so the job can save a checkpoint.

For this to work, two changes are necessary:

The user then still needs to set up signal handler in their code to detect it and take appropriate actions (e.g. save a checkpoint) but since this is very task-specific, it is up to the user to take care of it. A simple example should be added to the documentation, though.

For reference, here is a working example which doesn't use cluster_utils:

sbatch script:

#!/bin/bash

#SBATCH --partition=cpu-galvani
#SBATCH --time=00:03:00  # job will be cancelled after 3 minutes
#SBATCH --signal=USR1@30  # signal about 30 seconds before timeout
#SBATCH --output=output-%j.out

echo "Start execution"
srun python3.11 ./signal_listening_job.py
rc=$?

echo "signal_listening_job.py terminated with exit code ${rc}"

signal_listening_job.py:

"""A simple example script that does some dummy work and listens for a stop signal."""
import signal
import sys
import time

class Worker:

    N_STEPS = 100
    STEP_DURATION_S = 10

    def __init__(self) -> None:
        self.step = 0
        self.received_signal = False

        # register signal handler
        signal.signal(signal.SIGUSR1, self.signal_handler_sigusr1)

    def signal_handler_sigusr1(self, sig, frame):
        """This method gets called when the process receives the SIGUSR1 signal."""
        print("Received SIGUSR1, prepare for shutdown.", flush=True)
        self.received_signal = True

    def run(self) -> None:
        """Do the actual work here."""
        while (self.step < self.N_STEPS) and not self.received_signal:
            print("Step {}".format(self.step), flush=True)
            # do some dummy work (aka sleep)
            time.sleep(self.STEP_DURATION_S)

            self.step += 1

        if self.received_signal:
            print("Save current status and exit")
            time.sleep(5)
            ...
            # Nothing saved in this example but in an actual application this might be
            # the place to store a snapshot.

            # Simply return with a custom error code in this example.  When using
            # cluster utils, this would be the place to call `exit_for_resume()`.
            return 3

        return 0

if __name__ == "__main__":
    sys.exit(Worker().run())