Closed jacobtomlinson closed 4 years ago
Yes!! Thank you for clarifying this properly! This is not a great user experience and I'd love to see it improved! I for example use cheaper preemptive nodes that can be killed at any time, so this is very relevant.
start
, poll
, stop
(and get_ load_ and clear_state
)
poll
method used by JupyterHub in practice?poll
method?I think there are two different errors which are happening on my end.
One is when the pod dies JupyterHub seems to think it is still alive and continues to proxy requests through, this results in 408 timeout errors which I don't have logs for.
The second is the one above which happens when going back to JupyterHub and starting the notebook again. This is either caused by the KubeSpawner failing to request the pod, or by the cluster being low on resource and not fulfilling the pod request within the 300 second timeout.
I'm closing this issue at this point two years later. I think this has been mitigated by JupyterHub logic that is aware that servers can die and is at least consistently smart enough to realize it.
I've noticed that if a user's pod is removed outside of JupyterHub's control they get a 500 error.
If I start a notebook server and then the kubernetes node which is hosting the pod fails (this can be simulated by using
kubectl
to kill the pod manually) JupyterHub doesn't seem to notice the pod has gone.Then if the user tried to go back to the control panel and stop or start the notebook server again they get timeout 500 errors.
JupyterHub Logs