statgen / SLURM-examples

85 stars 28 forks source link

Potentially incorrect information on SLURM-examples page #8

Open novosirj opened 6 years ago

novosirj commented 6 years ago

Hi there,

Happened to be looking for some information on this subject when I came across some information on the SLURM-examples page, found here: https://github.com/statgen/SLURM-examples, that says the following:

"scontrol show job -dd . Shows all information about specific SLURM job. It is worth paying attention to the following information:

Requeue. Shows how many times your job was re-queued. Some jobs may have higher priority and may pre-empt (i.e. cancel) your running jobs and put them back to the queue. If your job takes too long time and Requeue is greater than 1 then, most probably, the reason why your job takes so long is because it was cancelled and re-queued several times."

I had briefly thought, wow, I learned a new thing, but I don't believe it's true. Per the scontrol manual, found here: https://slurm.schedmd.com/scontrol.html:

Requeue=<0|1> Stipulates whether a job should be requeued after a node failure: 0 for no, 1 for yes.

That's in the "update" section of the scontrol manual, but I don't have a single job that says anything other than Requeue=0 or Requeue=1. I did a little bit of looking at the source code, but can't really tell/maybe am looking in the wrong place.