martius-lab / cluster_utils

https://cluster-utils.readthedocs.io/stable/
Other
12 stars 0 forks source link

Failed jobs not detected on Slurm #87

Closed luator closed 6 months ago

luator commented 7 months ago

Jobs that are terminated by the cluster (e.g. due to timeout) seem not to be detected anymore by cluster_utils.

Example: sacct --parsable2 --format=JobID,NodeList,State,ExitCode ... gives the following output for jobs that die with an unhandled SIG1 signal:

JobID|NodeList|State|ExitCode
239023|galvani-cn001|COMPLETED|0:0
239023.batch|galvani-cn001|COMPLETED|0:0
239023.extern|galvani-cn001|COMPLETED|0:0
239023.0|galvani-cn001|CANCELLED|0:10
239024|galvani-cn001|COMPLETED|0:0
239024.batch|galvani-cn001|COMPLETED|0:0
239024.extern|galvani-cn001|COMPLETED|0:0
239024.0|galvani-cn001|CANCELLED|0:10
239025|galvani-cn001|COMPLETED|0:0
239025.batch|galvani-cn001|COMPLETED|0:0
239025.extern|galvani-cn001|COMPLETED|0:0
239025.0|galvani-cn001|CANCELLED|0:10
239026|galvani-cn001|COMPLETED|0:0
239026.batch|galvani-cn001|COMPLETED|0:0
239026.extern|galvani-cn001|COMPLETED|0:0
239026.0|galvani-cn001|CANCELLED|0:10

cluster_utils didn't recognise that they're gone, though.

I think adding srun changed the output of sacct in a way that requires an adjustment to the parsing done here.