jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

Server removed on status command error #315

Open ianhinder opened 11 hours ago

ianhinder commented 11 hours ago

Bug description

JupyterHub periodically polls the servers to check their state. BatchSpawner runs a batch system status command (e.g. "squeue") to determine whether the job is running, and parses the result against regexps to determine the status. Jobs which are in the pending, running or "unknown" state are considered not to have finished, whereas "not found" is interpreted as stopped, which fails the poll, and causes the hub to remove references to the server, leaving it orphaned.

Errors which occur when running the status command end up matched against the "Unknown" regexp, and there are some prepopulated "known" errors for SLURM:

^slurm_load_jobs error: (?:Socket timed out on send/recv|Unable to contact slurm controller)

These errors are interpreted as "Unknown status", and the poll returns None, meaning "server hasn't necessarily stopped", but anything else will fall through to "Not found", and the poll returns 1 meaning "server stopped".

I propose that this is not correct behaviour. "Not found" has a very specific meaning; the job is not in the squeue output, meaning it has finished. Any error running the status command means that it cannot be determined whether the job is running or not, and this should lead to the "Unknown" status, not the "Not found" status.

In my case, intermittent errors which are not included in the regexp cause the server to be removed from jupyterhub, leaving an orphaned SLURM job. I could add the specific errors to state_unknown_re, but this isn't a sustainable solution as there could be many possible errors. In my case, I am accessing SLURM over ssh, so any possible ssh or network-related error would have to be included.

One possible fix would be to adjust the logic in query_job_status so that errors lead to a specific string, such as "ERROR" appearing at the start of self.job_status, and then including "^ERROR" in state_unknown_re. The "Not found" status shouldn't be the fallback in case nothing else matches; it should be carefully matched against the conditions in which the job can be proved to have stopped. e.g. when the status output is empty. This would have to be done carefully to make sure it didn't break other batch systems than SLURM, which is the one I'm focusing on here.