When the Scheduler/RM fails?

jbaksta commented 4 years ago

Many schedulers and their sibling daemons are designed such that the controller can be failed for a set amount of time where the compute node daemons will continue to run the job even though the controller cannot be connected to. Is there any native ability w/in batchspawner to make attempts to try to query "failed" (i.e., non-zero exit status) commands again.

There are at least two edge cases that I can think of:

When the controller can't be talked to the proxy information get's removed and the user loses access to their notebook even though their job may very well continue on despite the hub thinking the job doesn't exist.
When the JupyterHub process tries to cancel a job, but cannot complete, the job may very well continue to run, but with the state information removed from the database, the user would lose access to the job.

Both are interesting cases and I want to configure the environment to be a little bit more resilient from the scheduler.

The quick approach is to override the query/submit/cancel commands as part of the configuration, but I'm also curious if anybody else has these issues or thought of them at this point?

joschaschmiedt commented 4 years ago

I agree this would be very useful. Under heavy load, our SLURM controller sometimes doesn't respond very well such that users lose connection to their servers.

Maybe a simple while loop would already help retrying a query/cancel command a number of times with increasing delay before throwing an error.

Hoeze commented 4 years ago

I think I ran into this issue as well. How about only querying the job state when the jupyterhub-singleuser stops notifying the jupyterhub for some time?

jbaksta commented 4 years ago

I don't mind the polling scheme, but if you do get a response from the single user server, it seems that would be a nice method. I'm not sure how much I'll dive into this for the code basis.

I also like the method of trying a query with a back off time before throwing an error.

This issue is getting a bit more painful for our users, so a quick and dirty wrap of squeue may be in short order to alleviate (not fix) a bit of the experience.

cmd-ntrf commented 4 years ago

I have written a draft PR #179 to address the problem. Feedback welcomed.

consideRatio commented 3 years ago

Closed by #187!

jupyterhub / batchspawner

When the Scheduler/RM fails? #171