mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

getStatus() for slurm TIMEOUT? #196

Open jgrn307 opened 6 years ago

jgrn307 commented 6 years ago

Following on my barrage of slurm-testing, it appears that batchtools reports SLURM jobs that result in a TIMEOUT as "Done" as opposed to "Error". Is there anything that can be done for this? I know the slurm "sacct" program (https://slurm.schedmd.com/sacct.html) can retrieve the "true" status of the jobs, is this used? If not, is there some way to identify job TIMEOUTS so a job can be restarted with a longer walltime?

Edit: I was wrong -- it looks like they are appearing in "Expired" (as do queued jobs) -- is there any way to distinguish between truly expired (TIMEOUT) vs. queued jobs?

jgrn307 commented 6 years ago

Related question: is there a way to resubmit expired jobs only (assuming they aren't queued), e.g. with longer walltimes?

mllg commented 6 years ago

You are probably looking for findExpired() and then submitJobs(findExpired(), resources = list(walltime = [longer walltime])).

batchtools defines a job as expired iff it is not found on the system (as determined by findOnSystem() which calls squeue internally; states R,S,CG,PD) and the job has not communicated its results back. Note that on some systems there potentially is a short period of time where jobs might get identified as expired even if they terminated successfully (if the scheduler is faster than the network file system).

jgrn307 commented 6 years ago

Ah, it might be better to use sacct instead of squeue for job status with slurm -- sacct gives the status for all jobs past/queued/completed/failed/timed out/etc.

From https://slurm.schedmd.com/sacct.html :

JOB STATE CODES BF BOOT_FAIL Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. DL DEADLINE Job terminated on deadline. F FAILED Job terminated with non-zero exit code or other failure condition. NF NODE_FAIL Job terminated due to failure of one or more allocated nodes. OOM OUT_OF_MEMORY Job experienced out of memory error. PD PENDING Job is awaiting resource allocation. PR PREEMPTED Job terminated due to preemption. R RUNNING Job currently has an allocation. RQ REQUEUED Job was requeued. RS RESIZING Job is about to change size. RV REVOKED Sibling was removed from cluster due to other cluster starting the job. S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. TO TIMEOUT Job terminated upon reaching its time limit.

jgrn307 commented 6 years ago

I can get you a sample output file from sacct if you let me know what parameters batchtools would need for getStatus() (besides the job status). One thing sacct will definitely need is the start date of the submission (otherwise it will default to same-day listing, which won't work for multi-day jobs) -- I assume this is embedded in batchtools' registry somewhere?

mllg commented 6 years ago

The date is stored internally in reg$status. To give more information for expired jobs you would need to extend the cluster functions with an additional call, i.e. getJobStatus(date, batch.id). Note that this is not possible for all (most?) cluster backends.

I'll look into it as soon as I find some spare time.