Open consideRatio opened 6 years ago
I just came across this because I was looking at orphaned, evicted pods on mybinder.org.
Using a pod with this state:
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container notebook was using 1900004Ki, which exceeds its request of 471859200.
The KubeSpawner logs show that the Spawner does notice that the pod has stopped and treat it as a failure:
[W 2021-11-20 20:47:31.401 JupyterHub base:1072] User jupyterlab-jupyterlab-demo-cmqn27qt server stopped, with exit [I 2021-11-20 20:47:31.401 JupyterHub proxy:309] Removing user jupyterlab-jupyterlab-demo-cmqn27qt from proxy (/user/jupyterlab-jupyterlab-demo-cmqn27qt/)
which means that the more severe problem that prompted this Issue may be resolved (I haven't been able to figure out the time between eviction and noticing that it stopped). But the pod is still not deleted for some reason.
Intro
In #223 we see that evicted user pods will cause a user to have a faulty routing and be unable to login, as the spawner does not realize the user pod is in bad shape, and can only be corrected by a hub restart.
I think I have found a solution to this, but first I want to share what I've learned about a pod's "status".
Theory
A pods status
What you see here under "STATUS", written out by
kubectl get pods
, is actually the a ContainerStatus'sreason
.status.phase
The phase is easy to overview, but it is not what you see if you write
kubectl get pods
even though you will recognizePending
andRunning
.status.containerStatuses.[0].state / lastState
This is what you actually see when you write
kubectl get pods
in the STATUS field. There are three kinds of states:Running
,Terminated
,Waitining
. BothTerminated
andWaiting
has areason
field along with amessage
field.Issue analysis
Inspect this code
https://github.com/jupyterhub/kubespawner/blob/472a66253a3a3e0c4f07c65830feef9a273d3ec4/kubespawner/spawner.py#L1316-L1332
The code's execution logic
Pending
? Do nothing.state
? Do something!!!I think we can do something here to fix #223, but I'm not sure what, because I have not been able to figure out how
status.phase
andstatus.containerStatuses[<the notebook container>].state
will behave if we have an Evicted pod for example.Suggested change and action plan
Perhaps we should delete pods that are in the
Succeeded
andFailed
status.phase
? That would probably make routes etc for users having pods with akubectl get pods
"STATUS" ofCompleted
orEvicted
be deleted properly and be able to respawn without needing the hub to restart.Ping @minrk @betatim @choldgraf !
Things to learn / document
Concrete questions I'd like answered
status.phase
when it happens?status.phase
when a container is found in terminated state? We should log something aboutc.state.terminated.reason
as well asdata.status.phase
whenc.state.terminated
is truthy.References
By looking at the PodStatus object, you can inspect nested resources like the phase field, or the containerStatuses array of ContaerinStatus etc...
I made a mindmap about pod.state things and events.