riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity
http://law.readthedocs.io
BSD 3-Clause "New" or "Revised" License
98 stars 41 forks source link

Fix querying slurm jobs after they're gone #125

Closed lmoureaux closed 2 years ago

lmoureaux commented 2 years ago

When a slurm job is cancelled out of Law's control and while Law is not actively polling it, Law still thinks it is running. On the first poll, it will then normally find out that the job has failed and restart it.

However, the slurm scheduler forgets about old jobs after some time. Trying to query the status of such a job returns an error. Handle this by returning a FAILED status for all jobs when squeue fails with "Invalid job id specified".

lmoureaux commented 2 years ago

I'm getting this error all the time without the patch, and no jobs are submitted: image It works smoothly with my change

riga commented 2 years ago

Hi @lmoureaux ,

Thanks for the PR! (this escaped my radar for some reason) I’ll have a look asap and get back to you :)

riga commented 2 years ago

@lmoureaux I could finally reproduce the issue you are seeing! Really thanks a lot for spotting this.

Indeed, squeue returns an error in case the job id is not accessible anymore. I also found out that this error only appears when a single non-existing job id is queried. If one tries e.g. --jobs foo,bar, the command does not fail (exit code 0) but the query result is empty. Weird behavior.

However, below the squeue query, there is a check using the sacct accounting tool which can access job information (and error codes / reasons for a longer period). In my test run, it produces

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12957592     cf.Calibr+    cms-uhh     unihh2         40 CANCELLED+      0:0
12957592.ba+      batch                unihh2         40  CANCELLED     0:15

and the error message can actually be very useful. Therefore, we should perhaps not create and return an exception as you did in the PR but rather forward these cases to the sacct query below. I'll push that change to your branch (or create a PR for you :) ).

riga commented 2 years ago

The response in the job loop is now

21:32:12: all: 1, pending: 0 (+0), running: 0 (+0), finished: 0 (+0), retry: 1 (+1), failed: 0 (+0)
1 failed job(s) in task cf.CalibrateEvents_hbt_config_analy__1_0_79a0cfc16e:
    job: 1, branch(es): 0, id: 12957592, status: retry, code: 0, error: CANCELLED+, log: /home/.../stdall_0To1.txt
going to submit 1 slurm job(s)

and transparently prints the original reason that is accessible through sacct.

lmoureaux commented 2 years ago

Thanks for the analysis and the better fix!