Introduce unknown job status to handle communication problem with resource manager

cmd-ntrf commented 4 years ago

This is a draft PR to handle the case where the resource manager does not return a status that is valid for BatchSpawner, but the error message returned indicate a problem with the resource manager and not the job.

This happens quite often with Slurm where squeue would return slurm_load_jobs error: Socket timed out on send/recv. The notebook job is still running fine, squeue was just not able to query its state. Currently, batchspawner will clear the state if it cannot query it, and the job will be subsequently cancelled, which is inconvenient. This problem has been exposed in #171 and #178.

The proposed solution is to refactor the job status querying to return a JobStatus object instead of the job_status string. We introduce four states: NOTQUEUED, RUNNING, PENDING, and UNKNOWN. Because the function deal with job status and not state, the method read_job_state is renamed query_job_status. This also makes the method names more consistent with the documentation.

Instead of calling the job_is* function, query_status_job returns a JobStatus object which can be used to determine what should be done next. The unknown status is determined by state_isunknown function, and for now only SlurmSpawner implements a regex to handle the cases where the query command exit code is 1.

rkdarst commented 4 years ago

Since I wasn't sure if I should edit directly here, I'm working on this in another branch and will push as another PR. Overall works well, taking some care to make the tests pass (including understand what the expected JH behavior is)

mbmilligan commented 4 years ago

Since it looks like #187 supersedes this PR, I am closing this one. Someone please correct me if that's incorrect.

jupyterhub / batchspawner

Introduce unknown job status to handle communication problem with resource manager #179