Slurm will terminate jobs if they take longer than the requested time. However, it can be configured to send a signal a bit before that, so the job can save a checkpoint.
For this to work, two changes are necessary:
Add the argument --signal=USR1@120 to the sbatch script. This will tell Slurm to send a SIGUSR1 (a user-defined signal) at least 120 seconds before the timeout. Might make sense to make the 120 configurable, for jobs where saving takes longer.
Execute the user script with srun (i.e. srun python ... instead of directly python ...) in the sbatch script. Otherwise the signal will not be send to the Python process.
The user then still needs to set up signal handler in their code to detect it and take appropriate actions (e.g. save a checkpoint) but since this is very task-specific, it is up to the user to take care of it. A simple example should be added to the documentation, though.
For reference, here is a working example which doesn't use cluster_utils:
sbatch script:
#!/bin/bash
#SBATCH --partition=cpu-galvani
#SBATCH --time=00:03:00 # job will be cancelled after 3 minutes
#SBATCH --signal=USR1@30 # signal about 30 seconds before timeout
#SBATCH --output=output-%j.out
echo "Start execution"
srun python3.11 ./signal_listening_job.py
rc=$?
echo "signal_listening_job.py terminated with exit code ${rc}"
signal_listening_job.py:
"""A simple example script that does some dummy work and listens for a stop signal."""
import signal
import sys
import time
class Worker:
N_STEPS = 100
STEP_DURATION_S = 10
def __init__(self) -> None:
self.step = 0
self.received_signal = False
# register signal handler
signal.signal(signal.SIGUSR1, self.signal_handler_sigusr1)
def signal_handler_sigusr1(self, sig, frame):
"""This method gets called when the process receives the SIGUSR1 signal."""
print("Received SIGUSR1, prepare for shutdown.", flush=True)
self.received_signal = True
def run(self) -> None:
"""Do the actual work here."""
while (self.step < self.N_STEPS) and not self.received_signal:
print("Step {}".format(self.step), flush=True)
# do some dummy work (aka sleep)
time.sleep(self.STEP_DURATION_S)
self.step += 1
if self.received_signal:
print("Save current status and exit")
time.sleep(5)
...
# Nothing saved in this example but in an actual application this might be
# the place to store a snapshot.
# Simply return with a custom error code in this example. When using
# cluster utils, this would be the place to call `exit_for_resume()`.
return 3
return 0
if __name__ == "__main__":
sys.exit(Worker().run())
Slurm will terminate jobs if they take longer than the requested time. However, it can be configured to send a signal a bit before that, so the job can save a checkpoint.
For this to work, two changes are necessary:
--signal=USR1@120
to the sbatch script. This will tell Slurm to send a SIGUSR1 (a user-defined signal) at least 120 seconds before the timeout. Might make sense to make the 120 configurable, for jobs where saving takes longer.srun
(i.e.srun python ...
instead of directlypython ...
) in the sbatch script. Otherwise the signal will not be send to the Python process.The user then still needs to set up signal handler in their code to detect it and take appropriate actions (e.g. save a checkpoint) but since this is very task-specific, it is up to the user to take care of it. A simple example should be added to the documentation, though.
For reference, here is a working example which doesn't use cluster_utils:
sbatch script:
signal_listening_job.py: