Closed ftorradeflot closed 3 years ago
This issue was also reported in #171.
The solution is a bit more complex than you are currently proposing. If we simply keep the job_state as it is when the state command fails and the get state command always fail, batchspawner will maintain spawner alive forever.
This can happen for example with Slurm when a job runs out of time. squeue exit code is 1 when the job id is not in the queue, but also 1 when squeue fails to communicate with the ressource manager.
We would probably need a regex to determine if the state command is failing because of difficulty to communicate with the resource manager.
Closed by #187!
The call to check the status of the job may eventually crash and the reason won't in general be related to the status of the job. In case checking the job status crashes, it makes more sense to keep the previous one.
Our HTCondor batch system may become overloaded for some minutes from time to time. During these periods the condor_q command will always fail and with the current implementation it leads to a submit-kill loop.