Users get 500 error if pod is removed outside of JupyterHub control

jacobtomlinson commented 6 years ago

I've noticed that if a user's pod is removed outside of JupyterHub's control they get a 500 error.

If I start a notebook server and then the kubernetes node which is hosting the pod fails (this can be simulated by using kubectl to kill the pod manually) JupyterHub doesn't seem to notice the pod has gone.

Then if the user tried to go back to the control panel and stop or start the notebook server again they get timeout 500 errors.

JupyterHub Logs

Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/web.py", line 1512, in _execute
        result = yield result
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 713, in get
        raise copy.copy(exc).with_traceback(exc.__traceback__)
      File "/usr/local/lib/python3.6/dist-packages/tornado/web.py", line 1512, in _execute
        result = yield result
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 720, in get
        yield gen.with_timeout(timedelta(seconds=self.slow_spawn_timeout), spawner._spawn_future)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 445, in finish_user_spawn
        yield spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 439, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 378, in spawn
        ip_port = yield gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 995, in start
        timeout=self.start_timeout
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 135, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: pod/jupyter-rprudden did not start in 300 seconds!

consideRatio commented 6 years ago

Yes!! Thank you for clarifying this properly! This is not a great user experience and I'd love to see it improved! I for example use cheaper preemptive nodes that can be killed at any time, so this is very relevant.

My initial thoughts

This is a general issue for all spawners.
All spawners shall implement start, poll, stop (and get_ load_ and clear_state)
- Question 1: How is the poll method used by JupyterHub in practice?
- Question 2: We arrive at the 500 error, what process is returning that error?
- Question 3: Can we perhaps take a better action than leave a user at a 500 error if we actually know something about the state of the pod by using the poll method?

jacobtomlinson commented 6 years ago

I think there are two different errors which are happening on my end.

One is when the pod dies JupyterHub seems to think it is still alive and continues to proxy requests through, this results in 408 timeout errors which I don't have logs for.

The second is the one above which happens when going back to JupyterHub and starting the notebook again. This is either caused by the KubeSpawner failing to request the pod, or by the cluster being low on resource and not fulfilling the pod request within the 300 second timeout.

consideRatio commented 4 years ago

I'm closing this issue at this point two years later. I think this has been mitigated by JupyterHub logic that is aware that servers can die and is at least consistently smart enough to realize it.

jupyterhub / zero-to-jupyterhub-k8s

Users get 500 error if pod is removed outside of JupyterHub control #681

JupyterHub Logs

My initial thoughts