Closed luator closed 6 months ago
Jobs that are terminated by the cluster (e.g. due to timeout) seem not to be detected anymore by cluster_utils.
Example: sacct --parsable2 --format=JobID,NodeList,State,ExitCode ... gives the following output for jobs that die with an unhandled SIG1 signal:
sacct --parsable2 --format=JobID,NodeList,State,ExitCode ...
JobID|NodeList|State|ExitCode 239023|galvani-cn001|COMPLETED|0:0 239023.batch|galvani-cn001|COMPLETED|0:0 239023.extern|galvani-cn001|COMPLETED|0:0 239023.0|galvani-cn001|CANCELLED|0:10 239024|galvani-cn001|COMPLETED|0:0 239024.batch|galvani-cn001|COMPLETED|0:0 239024.extern|galvani-cn001|COMPLETED|0:0 239024.0|galvani-cn001|CANCELLED|0:10 239025|galvani-cn001|COMPLETED|0:0 239025.batch|galvani-cn001|COMPLETED|0:0 239025.extern|galvani-cn001|COMPLETED|0:0 239025.0|galvani-cn001|CANCELLED|0:10 239026|galvani-cn001|COMPLETED|0:0 239026.batch|galvani-cn001|COMPLETED|0:0 239026.extern|galvani-cn001|COMPLETED|0:0 239026.0|galvani-cn001|CANCELLED|0:10
cluster_utils didn't recognise that they're gone, though.
I think adding srun changed the output of sacct in a way that requires an adjustment to the parsing done here.
srun
sacct
Jobs that are terminated by the cluster (e.g. due to timeout) seem not to be detected anymore by cluster_utils.
Example:
sacct --parsable2 --format=JobID,NodeList,State,ExitCode ...
gives the following output for jobs that die with an unhandled SIG1 signal:cluster_utils didn't recognise that they're gone, though.
I think adding
srun
changed the output ofsacct
in a way that requires an adjustment to the parsing done here.