jupyterhub / kubespawner

Kubernetes spawner for JupyterHub
https://jupyterhub-kubespawner.readthedocs.io
BSD 3-Clause "New" or "Revised" License
550 stars 304 forks source link

Pod Evictions (memory, node shutodowns) - requires manual hub restart! #223

Closed consideRatio closed 1 year ago

consideRatio commented 6 years ago

About

We have an issue when pods are removed unexpectedly that seems to require a hub restart. This could happen if a node fails, is preempted, or an admin deletes a user pod I think. This is really bad I'd say as it requires the specific user that has this problem to contact and administrator and then the administrator has to restart the hub to solve it.

I deem this vital to fix as preemtible nodes (google) and spot nodes (amazon) are an awesome way to reduce cost, but usage of them risks causing this kind of huge trouble right now!

I'm still lacking some insight in whats going on in the proxy and the user management within JupyterHub though so I'm hoping someone can pick this issue up from my writeup. @minrk or @yuvipanda perhaps you know someone to /cc or could give me a pointer on where to look to solve this?

Experience summary

  1. Pod dies
  2. "503 - Service Unavailable"
  3. hub/admin stop pod
  4. "Your server is stopping"
  5. Distracting consequence
  6. Issue still not resolved
  7. Hub restart
  8. Everything is fine

Experience log

1. A preemptible node was reclaimed and my user pod was lost


2. Later - I visit the hub

image


3. Directly after - I visit hub/admin and press the stop server button

image


4. I revisit my singleuser server

[E 2018-08-01 09:40:50.170 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.stop_single_user..stop() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:730> exception=TimeoutError('pod/jupyter-erik-2esundell did not disappear in 300 seconds!',)> after timeout Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback future.result() File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 740, in stop await user.stop(name) File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 548, in stop await spawner.stop() File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1664, in stop timeout=self.start_timeout File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 155, in exponential_backoff raise TimeoutError(fail_message) TimeoutError: pod/jupyter-erik-2esundell did not disappear in 300 seconds!


#### 5. I look again

- This is now what I see
![image](https://user-images.githubusercontent.com/3837114/43514567-c20be862-9580-11e8-8fd6-6bb2f697bc1e.png)

- I think it is because excessive refreshing of the pending stop page, once for every log message entry in the written point above.

---

#### 6. I refresh

- This is what I see
![image](https://user-images.githubusercontent.com/3837114/43514641-e62f8208-9580-11e8-94f5-cf506be05880.png)

- This is what the proxy chp logs say - an route was added, but it is unreachable

09:41:36.531 - info: [ConfigProxy] Adding route /user/erik.sundell -> http://10.20.1.4:8888 09:41:36.532 - info: [ConfigProxy] 201 POST /api/routes/user/erik.sundell 09:42:36.528 - info: [ConfigProxy] 200 GET /api/routes 09:43:36.528 - info: [ConfigProxy] 200 GET /api/routes 09:44:36.529 - info: [ConfigProxy] 200 GET /api/routes 09:45:36.529 - info: [ConfigProxy] 200 GET /api/routes 09:46:36.529 - info: [ConfigProxy] 200 GET /api/routes 09:47:36.529 - info: [ConfigProxy] 200 GET /api/routes 09:48:19.959 - error: [ConfigProxy] 503 GET /user/erik.sundell/ Error: connect EHOSTUNREACH 10.20.1.4:8888 at Object.exports._errnoException (util.js:1020:11) at exports._exceptionWithHostPort (util.js:1043:20) at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1086:14)


#### 7. I restart the hub

- The hub logs after restart

[I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Authenticator: builtins.GenericAuthenticator [I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Spawner: kubespawner.spawner.KubeSpawner [I 2018-08-01 10:17:10.080 JupyterHub app:1010] Loading cookie_secret from env[JPY_COOKIE_SECRET] [W 2018-08-01 10:17:10.155 JupyterHub app:1129] JupyterHub.hub_connect_port is deprecated as of 0.9. Use JupyterHub.hub_connect_url to fully specify the URL for connecting to the Hub. [I 2018-08-01 10:17:10.184 JupyterHub app:1199] Not using whitelist. Any authenticated user will be allowed. [I 2018-08-01 10:17:10.237 JupyterHub reflector:176] watching for pods with label selector component=singleuser-server / field selector in namespace jh [W 2018-08-01 10:17:10.263 JupyterHub app:1460] erik.sundell appears to have stopped while the Hub was down [W 2018-08-01 10:17:10.317 JupyterHub app:1513] Deleting OAuth client jupyterhub-user-erik-sundell [I 2018-08-01 10:17:10.329 JupyterHub app:1849] Hub API listening on http://0.0.0.0:8081/hub/ [I 2018-08-01 10:17:10.329 JupyterHub app:1851] Private Hub API connect url http://10.0.8.147:8081/hub/ [I 2018-08-01 10:17:10.330 JupyterHub app:1864] Not starting proxy [I 2018-08-01 10:17:10.334 JupyterHub proxy:301] Checking routes [W 2018-08-01 10:17:10.335 JupyterHub proxy:363] Deleting stale route /user/erik.sundell/ [I 2018-08-01 10:17:10.337 JupyterHub app:1906] JupyterHub is now running at http://10.0.2.156:80/ [I 2018-08-01 10:18:10.341 JupyterHub proxy:301] Checking routes


#### 8. I login again and everything works
- The hub logs after restart and login

[I 2018-08-01 10:21:49.143 JupyterHub base:499] User logged in: erik.sundell [I 2018-08-01 10:21:49.145 JupyterHub log:158] 302 GET /hub/oauth_callback?code=[secret]&state=[secret]&session_state=[secret] -> /user/erik.sundell/ (@10.20.0.1) 630.45ms [I 2018-08-01 10:21:49.185 JupyterHub log:158] 302 GET /user/erik.sundell/ -> /hub/user/erik.sundell/ (@10.20.0.1) 1.56ms [I 2018-08-01 10:21:49.322 JupyterHub spawner:1550] PVC claim-erik-2esundell already exists, so did not create new pvc. [W 2018-08-01 10:21:59.231 JupyterHub base:679] User erik.sundell is slow to start (timeout=10) [I 2018-08-01 10:21:59.231 JupyterHub base:1016] erik.sundell is pending spawn [I 2018-08-01 10:21:59.238 JupyterHub log:158] 200 GET /hub/user/erik.sundell/ (erik.sundell@10.20.0.1) 10015.25ms [I 2018-08-01 10:22:00.703 JupyterHub log:158] 200 GET /hub/api (@10.20.1.4) 0.75ms [I 2018-08-01 10:22:03.854 JupyterHub base:628] User erik.sundell took 14.623 seconds to start [I 2018-08-01 10:22:03.854 JupyterHub proxy:242] Adding user erik.sundell to proxy /user/erik.sundell/ => http://10.20.1.4:8888 [I 2018-08-01 10:22:03.857 JupyterHub users:510] Server erik.sundell is ready

jgerardsimcock commented 6 years ago

I can confirm that we experience this frequently and my solution is always to restart the hub. We are on GKE.

consideRatio commented 6 years ago

@minrk @betatim @choldgraf this looks suspicious to me

1. Only DELETED state

We don't look for EVICTED state for example. I can look into exactly what states you can be in and what states will overlap, because they will. For example, you will be considered Pending even though you are ContainerCreating in the Status I recall. Perhaps we can find one state, including DELETED and EVICTED at the same time? https://github.com/jupyterhub/kubespawner/blob/472a66253a3a3e0c4f07c65830feef9a273d3ec4/kubespawner/reflector.py#L217-L219

2. Abort in finally

A finally statement has a stop and a break within it even though the except statements have continue etc... Hmmm... UPDATE: oops okay hmmm perhaps not an issue, I jumped the gun. https://github.com/jupyterhub/kubespawner/blob/472a66253a3a3e0c4f07c65830feef9a273d3ec4/kubespawner/reflector.py#L244-L248

For reference, this is what happens if you have an try / except / else issue...

Try -> Except -> Finally

image

Try -> Else -> Finally

image

consideRatio commented 6 years ago

@minrk helped me narrow the space where this needs to be fixed:

spawner.py@poll, or the function called from that including spawner.py@stop.

kakal commented 5 years ago

Hi, @consideRatio Is there any progress on this issue? I have the same problem that have to restart the hub menully. Would you please give me some suggestion for this defect?

consideRatio commented 5 years ago

Hi @kakal! I'm happy to see some activity for this issue as I think it is very important. I have not done any additional work to resolve this since. Hoping to get back to it.

consideRatio commented 5 years ago

From check_spawner, kubespawner's poll is invoked, and after it returns check_spawner will decide to kill the evicted pod. https://github.com/jupyterhub/kubespawner/blob/01373369044f8fd9754956bb6a3b3cfeecffb939/kubespawner/spawner.py#L1458-L1494

What needs to happen, is that kubespawner or the hub realize that pods become evicted.

Questions raised

consideRatio commented 5 years ago

The pod reflector - code base investigation

The pod reflector will watch and update itself, and it will remove things from its list if it has

consideRatio commented 5 years ago

A Python 3 memory bomb for debugging

No... This failed. The process was instead killed. Hmmm... I ran into the "out of memory killer" or "OOM killer".

#!/usr/bin/env python

import sys
import time

if len(sys.argv) != 2:
    print("usage: fillmem <number-of-megabytes>")
    sys.exit()

count = int(sys.argv[1])

megabyte = (0,) * (1024 * 128)

data = megabyte * count

while True:
    time.sleep(1)

image

rkdarst commented 5 years ago

For reference, I'm quite sure I'm seeing this too. I don't have any extra help in debugging, other than the database showed pod_name in the state dict still when it wasn't running anymore. Next week I'll be able to investigate more, can't do more this week.

consideRatio commented 5 years ago

I think what happens is that we consider the pod terminated, but it remains in a Failed state.

https://github.com/kubernetes/kubernetes/issues/54525#issuecomment-340035375

In the poll function there is this section, note the check for ctr_status, I think that is missing for pods that has entered a Failed or Succeeded state where all containers have terminated according to specifications of these pod phases.

https://github.com/jupyterhub/kubespawner/blob/01373369044f8fd9754956bb6a3b3cfeecffb939/kubespawner/spawner.py#L1476-L1481

If a pod is evicted due to memory preassure etc, the pod will enter a Failed state, but the key difference from being culled into a Succeeded (?) state, is that the pod with the Failed state will remain and not be cleaned up quickly. It will have a pod phase reason being "Evicted" I think. So if we to surgically want to solve this, I think we should look for that specific scenario and decide what to do based on that. If we delete it we may loose some debugging information though, but it don't we will fail to spawn new pods as it blocks the pods name I think.

Perhaps we should look for this case, log stuff of relevance, and then delete it?

consideRatio commented 5 years ago

Below are three pods that will remain if stopped, and I can imagine could cause trouble by not being deleted and will remain listed in kubectl get pods for example, and also block creation of new user pods due to the new user pods having the same name as the old.

Observations

A pod quit - completes (pod phase: Success)

# Created by: kubectl run ubuntu-success --image=ubuntu --restart=Never -- /bin/bash -c 'sleep 10'
#
# Inspect by: kubectl describe ubuntu-success
#
Name:         ubuntu-success
Namespace:    jh-dev
Priority:     0
Node:         jh-dev-control-plane/172.19.0.1
Start Time:   Thu, 10 Oct 2019 09:50:27 +0200
Labels:       run=ubuntu-success
Annotations:  <none>
Status:       Succeeded
IP:           192.168.194.144
Containers:
  ubuntu-success:
    Container ID:  containerd://f2b1ae47a6ea4a848f9e433a4ff7eafdd991c7c15a2f15724fba461c4d044959
    Image:         ubuntu
    Image ID:      docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/bash
      -c
      sleep 10
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 10 Oct 2019 09:50:30 +0200
      Finished:     Thu, 10 Oct 2019 09:50:40 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4bc55 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-4bc55:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4bc55
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                           Message
  ----    ------     ----  ----                           -------
  Normal  Scheduled  90s   default-scheduler              Successfully assigned jh-dev/ubuntu-success to jh-dev-control-plane
  Normal  Pulling    89s   kubelet, jh-dev-control-plane  Pulling image "ubuntu"
  Normal  Pulled     87s   kubelet, jh-dev-control-plane  Successfully pulled image "ubuntu"
  Normal  Created    87s   kubelet, jh-dev-control-plane  Created container ubuntu-success
  Normal  Started    87s   kubelet, jh-dev-control-plane  Started container ubuntu-success

A pod quit - errors (pod phase: Failed)

# Created by: kubectl run ubuntu --image=ubuntu --restart=Never -- /bin/bash -c 'sleep 30; time-to-error-this-is-a-bogus-command-not-to-be-found'
#
# Inspect by: kubectl describe pods ubuntu
#
Name:         ubuntu
Namespace:    jh-dev
Priority:     0
Node:         jh-dev-control-plane/172.19.0.1
Start Time:   Thu, 10 Oct 2019 09:40:54 +0200
Labels:       run=ubuntu
Annotations:  <none>
Status:       Failed
IP:           192.168.194.143
Containers:
  ubuntu:
    Container ID:  containerd://2432fa15a30413c2ca24bb7a39e9a94c542d2c09a67f71caba3d45867acd05e0
    Image:         ubuntu
    Image ID:      docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/bash
      -c
      sleep 30; eeasldfj
    State:          Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Thu, 10 Oct 2019 09:41:00 +0200
      Finished:     Thu, 10 Oct 2019 09:41:30 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4bc55 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-4bc55:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4bc55
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From                           Message
  ----    ------     ----   ----                           -------
  Normal  Scheduled  6m41s  default-scheduler              Successfully assigned jh-dev/ubuntu to jh-dev-control-plane
  Normal  Pulling    6m40s  kubelet, jh-dev-control-plane  Pulling image "ubuntu"
  Normal  Pulled     6m36s  kubelet, jh-dev-control-plane  Successfully pulled image "ubuntu"
  Normal  Created    6m35s  kubelet, jh-dev-control-plane  Created container ubuntu
  Normal  Started    6m35s  kubelet, jh-dev-control-plane  Started container ubuntu

A pod quit - evicted (pod phase: Failed, reason: Evicted)

Name:               proprietary-crawler-preemptive-79f7cbc6cf-rzck9
Namespace:          proprietary
Priority:           0
PriorityClassName:  <none>
Node:               gke-general-preemptible-40c94gb-pool-35dd7b4d-mv7w/
Start Time:         Fri, 26 Apr 2019 03:55:55 -0300
Labels:             app.kubernetes.io/instance=proprietary-crawler
                    app.kubernetes.io/name=proprietary-crawler
                    date=1544806882
                    pod-template-hash=3593767279
Annotations:        <none>
Status:             Failed
Reason:             Evicted
Message:            The node was low on resource: memory. Container proprietary-crawler was using 9402820Ki, which exceeds its request of 7G. 
IP:                 
Controlled By:      ReplicaSet/proprietary-crawler-preemptive-79f7cbc6cf
Containers:
  proprietary-crawler:
    Image:      <redacted>
    Port:       3333/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     8
      memory:  10G
    Requests:
      cpu:      4
      memory:   7G
    Liveness:   http-get http://:3333/health/proprietary-crawler/health delay=60s timeout=1s period=3s #success=1 #failure=5
    Readiness:  http-get http://:3333/health/proprietary-crawler/health delay=60s timeout=1s period=3s #success=1 #failure=3
    Environment:
      s_env:              prod
      JAVA_TOOL_OPTIONS:  -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -Xms4g -Xmx8g
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cwcqv (ro)
Volumes:
  default-token-cwcqv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cwcqv
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  environment=production
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

kubectl get -o yaml pods

Succeeded

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2019-10-10T07:50:27Z"
  labels:
    run: ubuntu-success
  name: ubuntu-success
  namespace: jh-dev
  resourceVersion: "38440"
  selfLink: /api/v1/namespaces/jh-dev/pods/ubuntu-success
  uid: 0bc3db6d-68e4-42b2-af68-029d5108f81c
spec:
  containers:
  - args:
    - /bin/bash
    - -c
    - sleep 10
    image: ubuntu
    imagePullPolicy: Always
    name: ubuntu-success
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-4bc55
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: jh-dev-control-plane
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-4bc55
    secret:
      defaultMode: 420
      secretName: default-token-4bc55
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:27Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:41Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:41Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:27Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://f2b1ae47a6ea4a848f9e433a4ff7eafdd991c7c15a2f15724fba461c4d044959
    image: docker.io/library/ubuntu:latest
    imageID: docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    lastState: {}
    name: ubuntu-success
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: containerd://f2b1ae47a6ea4a848f9e433a4ff7eafdd991c7c15a2f15724fba461c4d044959
        exitCode: 0
        finishedAt: "2019-10-10T07:50:40Z"
        reason: Completed
        startedAt: "2019-10-10T07:50:30Z"
  hostIP: 172.19.0.1
  phase: Succeeded
  podIP: 192.168.194.144
  qosClass: BestEffort
  startTime: "2019-10-10T07:50:27Z"

Errored

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2019-10-10T07:40:54Z"
  labels:
    run: ubuntu
  name: ubuntu
  namespace: jh-dev
  resourceVersion: "37485"
  selfLink: /api/v1/namespaces/jh-dev/pods/ubuntu
  uid: 2a0cb50c-24a5-4461-9384-ace2852092f1
spec:
  containers:
  - args:
    - /bin/bash
    - -c
    - sleep 30; eeasldfj
    image: ubuntu
    imagePullPolicy: Always
    name: ubuntu
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-4bc55
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: jh-dev-control-plane
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-4bc55
    secret:
      defaultMode: 420
      secretName: default-token-4bc55
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:40:54Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:41:31Z"
    message: 'containers with unready status: [ubuntu]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:41:31Z"
    message: 'containers with unready status: [ubuntu]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:40:54Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://2432fa15a30413c2ca24bb7a39e9a94c542d2c09a67f71caba3d45867acd05e0
    image: docker.io/library/ubuntu:latest
    imageID: docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    lastState: {}
    name: ubuntu
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: containerd://2432fa15a30413c2ca24bb7a39e9a94c542d2c09a67f71caba3d45867acd05e0
        exitCode: 127
        finishedAt: "2019-10-10T07:41:30Z"
        reason: Error
        startedAt: "2019-10-10T07:41:00Z"
  hostIP: 172.19.0.1
  phase: Failed
  podIP: 192.168.194.143
  qosClass: BestEffort
  startTime: "2019-10-10T07:40:54Z"

Evicted

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"my-pod-2887686308","uid":"2169cfab-74f5-11e6-be44-0800277ad4a8","apiVersion":"extensions","resourceVersion":"188007"}}
    kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
      my-pod'
  creationTimestamp: 2016-09-07T12:18:32Z
  generateName: my-pod-2887686308-
  labels:
    pod-template-hash: "2887686308"
    run: my-pod
  name: my-pod-2887686308-3x99a
  namespace: default
  ownerReferences:
  - apiVersion: extensions/v1beta1
    controller: true
    kind: ReplicaSet
    name: my-pod-2887686308
    uid: 2169cfab-74f5-11e6-be44-0800277ad4a8
  resourceVersion: "188038"
  selfLink: /api/v1/namespaces/default/pods/my-pod-2887686308-3x99a
  uid: 31277ddf-74f5-11e6-be44-0800277ad4a8
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: my-pod
    resources:
      requests:
        cpu: 100m
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-akryk
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: kubernetes-node-1
  restartPolicy: Always
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: default-token-akryk
    secret:
      defaultMode: 420
      secretName: default-token-akryk
status:
  message: The node was low on memory.
  phase: Failed
  reason: Evicted
  startTime: 2016-09-07T12:18:32Z

https://github.com/microsoft/pai/blob/71ad3d63ff6b37301947c05663bc61b44d71032e/prometheus/test/data/no_condtion_pod.json#L177-L182

consideRatio commented 5 years ago

Hmmm... But in the initial description of the issue, things remained in...

[I 2018-08-01 09:37:04.053 JupyterHub base:978] erik.sundell is pending stop

Which is what happens while the spawner is waiting to see the pod disappear from its resource list...

https://github.com/jupyterhub/kubespawner/blob/01373369044f8fd9754956bb6a3b3cfeecffb939/kubespawner/spawner.py#L1798-L1807

But that only happens when...

https://github.com/jupyterhub/kubespawner/blob/01373369044f8fd9754956bb6a3b3cfeecffb939/kubespawner/reflector.py#L229-L235

But hmm... Wasn't a delete call made just before this wait started?

https://github.com/jupyterhub/kubespawner/blob/01373369044f8fd9754956bb6a3b3cfeecffb939/kubespawner/spawner.py#L1780-L1797

Yes, but it returned 404 as indicated by this log trace:

[W 2018-08-01 09:36:05.688 JupyterHub spawner:1657] No pod jupyter-erik-2esundell to delete. Assuming already deleted.

Why could we end up with a 404 on a pod delete request, while the pod remains in the list? has it been deleted already, but it did not trigger a DELETE V1WatchEvent at any time?

consideRatio commented 5 years ago

Ah hmmm I realize that we can watch for k8s events about pods being Evicted and take action based on that... https://www.bluematador.com/blog/kubernetes-events-explained

tritab commented 4 years ago

We've been experiencing this issue when a user pod exceeds the resource limit (e.g. consumes >8GB of RAM). In this scenario, k8s was restarting the user's pod with an OOM Killed message and the hub was unable to find the new user pod. This required the user to contact the admin team to manually delete their pod (we didn't try restarting the hub instead).

We have implemented a couple updates that don't resolve the underlying issue, but do improve the user experience.

  1. Set c.KubeSpawner.extra_pod_config = { "restartPolicy" : "Never" } This causes the hub to avoid restarting the pod and the user is sent to the spawner screen instead. While this isn't optimal, it does remove the admin team from the resolution.
  2. Installed https://github.com/yuvipanda/nbresuse into the notebook image(s). Initially, we had hoped to help users avoid exceeding the pod memory limit by showing a meter of their current memory usage. The extension does this well and has a nice warning feature where it turns red when exceeding a certain percent (e.g. 90%). This plugin has an added bonus that it actually proactively kills the offending process prior to the pod exceeding the memory limit. We weren't expecting this, but it seems to be reducing the frequency of these incidents.

The above items certainly don't resolve the primary issue, but have helped us reduce user impacts.

As far as the root cause, would it be possible to modify the hub to capture a unique label for the pod rather than the IP address? If the label were unique, then it has the upside that it doesn't change, even if the IP address does change due to being rescheduled by k8s.

manics commented 4 years ago

As far as the root cause, would it be possible to modify the hub to capture a unique label for the pod rather than the IP address?

@tritab A similar issue was brought up in https://github.com/jupyterhub/kubespawner/pull/386 which requires a persistent pod identifier for SSL certificates. The solution in that PR was to create a service. Presumably that should also solve your problem?

meeseeksmachine commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/weve-been-seeing-503s-occasionally-for-user-servers/3824/2

cameronraysmith commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/weve-been-seeing-503s-occasionally-for-user-servers/3824/2

Yes, in particular, the contents of the attached log.

jamesbond-inactive-to-503-hub-interface.txt

consideRatio commented 4 years ago

I think the current status is that the hub won't require a manual restart, but that it will require some time before its check_routes function has concluded that a pod is no longer functional. It would be good to have an update on the situation or a concrete plan to make it more robust.

consideRatio commented 1 year ago

I think the current status is that the hub won't require a manual restart, but that it will require some time before its check_routes function has concluded that a pod is no longer functional.

This is the situation, and the issue is related to avoiding manual restarts specifically. Let's mark this as closed.