Monitor for stuck / failed regrid jobs

Joshdpaul commented 5 months ago

Consider implementing some sort of monitoring/managing of the regrid jobs. Could this be done via slurm - getting job numbers and their time elapsed data from squeue or sacct?

We could set a time limit (maybe 3 hrs) and if they are taking longer than that, cancel and restart. Try to restart any failed jobs (once?).

Joshdpaul commented 5 months ago

See the indicators/slurm.py/submit_sbatch() function for ideas on how to submit jobs and return the job ids as a list for monitoring. If we want to run via Prefect, this step of submitting sbatch jobs should move outside of the regrid_cmip6.ipynb notebook and into a regular python executable.

Joshdpaul commented 3 months ago

Copied from Slack tech-sysadmin channel from 3/7/24:

TLDR about requeueing slurm jobs: Today I've been messing with slurm requeue features and have figured out a few things. Automatic requeueing is possible using commands completely contained inside of the sbatch script (yay!) but it involves a little more work than just setting the time limit and the --requeue flag. Due to the slurm config on Chinook, we also have to do something a little hacky with trap following the advice here, but it works expected. If you use sbatch to run the test script below, it will automatically restart after ~1min and will count the number of restarts using the SLURM_RESTART_COUNT env variable that is created by slurm when requeueing. Termination signals are "trapped" and the job requeued until we reach a defined execution attempt limit (in this case 3). Above that limit, the script just exits and print a custom error message in the output file.

The test script:

#!/bin/sh
#SBATCH --nodes=1
#SBATCH -p debug
#SBATCH --time=00:01:00
#SBATCH --output /home/jdpaul3/cmip6-utils_testing/squeue_test_%j.out

echo Start slurm && date
export ATTEMPT=$(( x=1, $SLURM_RESTART_COUNT+x ))

echo "Execution attempt: $ATTEMPT"

if [ $ATTEMPT -le 3 ]
then
    trap "scontrol requeue ${SLURM_JOB_ID}" SIGTERM
    sleep 120 ###this is where the regrid code would execute
else
    echo "Too many execution attempts"
    exit
fi

The output file:

Start slurm
Thu Mar  7 14:49:26 AKST 2024
Execution attempt: 4
Too many execution attempts

Unfortunately 1min is the smallest time resolution slurm allows for the --time flag, and so this test takes 5-10 minutes to run :slightly_smiling_face: And FWIW, the ideal slurm config would have JobRequeue = 1, KillWait = 30, and RequeueExit=15. That would allow the --requeue flag to initiate an automatic restart of the job if it terminates before completion. However, RequeueExit is empty in the current config so we have to use the conditional statements and the trap to work around that as described in the SO post. Also it's worth mentioning that the jobs do not immediately restart, they get added back into the queue like any other job, so depending on current usage on Chinook nodes the restart process could take a long time.

ua-snap / cmip6-utils

Monitor for stuck / failed regrid jobs #5