jupyterhub / kubespawner

Kubernetes spawner for JupyterHub
https://jupyterhub-kubespawner.readthedocs.io
BSD 3-Clause "New" or "Revised" License
541 stars 303 forks source link

Investigation - How to react properly to evicted pods #233

Open consideRatio opened 6 years ago

consideRatio commented 6 years ago

Intro

In #223 we see that evicted user pods will cause a user to have a faulty routing and be unable to login, as the spawner does not realize the user pod is in bad shape, and can only be corrected by a hub restart.

I think I have found a solution to this, but first I want to share what I've learned about a pod's "status".

Theory

A pods status

image What you see here under "STATUS", written out by kubectl get pods, is actually the a ContainerStatus's reason.

status.phase

The phase is easy to overview, but it is not what you see if you write kubectl get pods even though you will recognize Pending and Running. image

status.containerStatuses.[0].state / lastState

This is what you actually see when you write kubectl get pods in the STATUS field. There are three kinds of states: Running, Terminated, Waitining. Both Terminated and Waiting has a reason field along with a message field. image

Issue analysis

Inspect this code

https://github.com/jupyterhub/kubespawner/blob/472a66253a3a3e0c4f07c65830feef9a273d3ec4/kubespawner/spawner.py#L1316-L1332

The code's execution logic

  1. Is the pod phase Pending? Do nothing.
  2. If not, does the notebook container lack a state? Do something!!!
  3. If not, is the notebook container a terminated state? Do something!!!
  4. Else, do nothing.

I think we can do something here to fix #223, but I'm not sure what, because I have not been able to figure out how status.phase and status.containerStatuses[<the notebook container>].state will behave if we have an Evicted pod for example.

Suggested change and action plan

Perhaps we should delete pods that are in the Succeeded and Failed status.phase? That would probably make routes etc for users having pods with a kubectl get pods "STATUS" of Completed or Evicted be deleted properly and be able to respawn without needing the hub to restart.

Ping @minrk @betatim @choldgraf !

Things to learn / document

Concrete questions I'd like answered

References

By looking at the PodStatus object, you can inspect nested resources like the phase field, or the containerStatuses array of ContaerinStatus etc...

I made a mindmap about pod.state things and events.

minrk commented 2 years ago

I just came across this because I was looking at orphaned, evicted pods on mybinder.org.

Using a pod with this state:

Status:               Failed
Reason:               Evicted
Message:              The node was low on resource: memory. Container notebook was using 1900004Ki, which exceeds its request of 471859200.

The KubeSpawner logs show that the Spawner does notice that the pod has stopped and treat it as a failure:

[W 2021-11-20 20:47:31.401 JupyterHub base:1072] User jupyterlab-jupyterlab-demo-cmqn27qt server stopped, with exit [I 2021-11-20 20:47:31.401 JupyterHub proxy:309] Removing user jupyterlab-jupyterlab-demo-cmqn27qt from proxy (/user/jupyterlab-jupyterlab-demo-cmqn27qt/)

which means that the more severe problem that prompted this Issue may be resolved (I haven't been able to figure out the time between eviction and noticing that it stopped). But the pod is still not deleted for some reason.