JupyterHub/KubeSpawner lost track of started user server, pod kept running

darabos commented 1 year ago

Bug description

Our Zero-to-JupyterHub 2.0.0 instance running on GKE (1.24.5-gke.600) has failed to start an instance with this error message:

Multi-Attach error for volume "pvc-27f9f2c0-f18d-449f-a9c3-b0d909bd0ea0"
Volume is already exclusively attached to one node and can't be attached to another

The JupyterHub admin interface does not show the instance as running. But looking in GKE the pod is actually running, days after the incident. So the pod was leaked. (We actually have 4 cases of this.)

In the JupyterHub logs I see the following for this user:

[I 2022-12-12 09:15:45.005 JupyterHub roles:238] Adding role user for User: XXXXXX@mail.tau.ac.il
[I 2022-12-12 09:15:45.021 JupyterHub base:810] User logged in: XXXXXX@mail.tau.ac.il
[I 2022-12-12 09:15:45.192 JupyterHub log:186] 302 GET /hub/ -> /hub/spawn (XXXXXX@mail.tau.ac.il@10.128.0.69) 13.12ms
[I 2022-12-12 09:15:45.367 JupyterHub provider:651] Creating oauth client jupyterhub-user-XXXXXX%40mail.tau.ac.il
[I 2022-12-12 09:15:45.387 JupyterHub spawner:2509] Attempting to create pvc claim-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 09:15:45.389 JupyterHub log:186] 302 GET /hub/spawn -> /hub/spawn-pending/XXXXXX@mail.tau.ac.il (XXXXXX@mail.tau.ac.il@10.128.0.69) 38.38ms
[I 2022-12-12 09:15:45.423 JupyterHub spawner:2469] Attempting to create pod jupyter-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 09:15:45.550 JupyterHub pages:394] XXXXXX@mail.tau.ac.il is pending spawn
[I 2022-12-12 09:15:45.552 JupyterHub log:186] 200 GET /hub/spawn-pending/XXXXXX@mail.tau.ac.il (XXXXXX@mail.tau.ac.il@10.128.0.69) 4.76ms
[I 2022-12-12 09:16:24.943 JupyterHub base:963] User XXXXXX@mail.tau.ac.il took 39.588 seconds to start
[I 2022-12-12 09:16:24.943 JupyterHub proxy:333] Adding user XXXXXX@mail.tau.ac.il to proxy /user/XXXXXX@mail.tau.ac.il/ => http://10.108.1.31:8888
[I 2022-12-12 09:16:24.947 JupyterHub users:749] Server XXXXXX@mail.tau.ac.il is ready
[I 2022-12-12 09:16:24.948 JupyterHub log:186] 200 GET /hub/api/users/XXXXXX@mail.tau.ac.il/server/progress (XXXXXX@mail.tau.ac.il@10.128.0.69) 38939.51ms
[I 2022-12-12 09:16:25.114 JupyterHub log:186] 302 GET /hub/spawn-pending/XXXXXX@mail.tau.ac.il -> /user/XXXXXX@mail.tau.ac.il/ (XXXXXX@mail.tau.ac.il@10.128.0.69) 4.88ms
[I 221212 10:16:40 __init__:236] Culling server XXXXXX@mail.tau.ac.il (inactive for 01:00:14)
[I 2022-12-12 10:16:40.036 JupyterHub proxy:359] Removing user XXXXXX@mail.tau.ac.il from proxy (/user/XXXXXX@mail.tau.ac.il/)
[I 2022-12-12 10:16:40.039 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 10:16:43.062 JupyterHub base:1170] User XXXXXX@mail.tau.ac.il server took 3.025 seconds to stop
[I 2022-12-12 10:16:43.063 JupyterHub log:186] 204 DELETE /hub/api/users/XXXXXX%40mail.tau.ac.il/server (jupyterhub-idle-culler@127.0.0.1) 3029.02ms
[I 2022-12-12 15:28:20.375 JupyterHub log:186] 302 GET /user/XXXXXX@mail.tau.ac.il/lab -> /hub/user/XXXXXX@mail.tau.ac.il/lab (@10.128.0.69) 1.01ms
[W 2022-12-12 15:28:20.593 JupyterHub log:186] 424 GET /hub/user/XXXXXX@mail.tau.ac.il/lab (XXXXXX@mail.tau.ac.il@10.128.0.69) 14.60ms
[I 2022-12-12 15:28:23.322 JupyterHub provider:651] Creating oauth client jupyterhub-user-XXXXXX%40mail.tau.ac.il
[I 2022-12-12 15:28:23.346 JupyterHub spawner:2509] Attempting to create pvc claim-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 15:28:23.349 JupyterHub log:186] 302 GET /hub/spawn/XXXXXX@mail.tau.ac.il?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab -> /hub/spawn-pending/XXXXXX@mail.tau.ac.il?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab (XXXXXX@mail.tau.ac.il@10.128.0.69) 46.38ms
[I 2022-12-12 15:28:23.363 JupyterHub spawner:2525] PVC claim-XXXXXX-40mail-2etau-2eac-2eil already exists, so did not create new pvc.
[I 2022-12-12 15:28:23.370 JupyterHub spawner:2469] Attempting to create pod jupyter-XXXXXX-40mail-2etau-2eac-2eil, with timeout 3
[I 2022-12-12 15:28:23.553 JupyterHub pages:394] XXXXXX@mail.tau.ac.il is pending spawn
[I 2022-12-12 15:28:23.556 JupyterHub log:186] 200 GET /hub/spawn-pending/XXXXXX@mail.tau.ac.il?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab (XXXXXX@mail.tau.ac.il@10.128.0.69) 9.76ms
[I 2022-12-12 15:31:36.825 JupyterHub log:186] 302 GET /hub/ -> /user/XXXXXX@mail.tau.ac.il/ (XXXXXX@mail.tau.ac.il@10.128.0.69) 8.11ms
[I 2022-12-12 15:31:39.064 JupyterHub log:186] 302 GET /user/XXXXXX@mail.tau.ac.il/files/Go%20to%20LynxKite.html -> /hub/user/XXXXXX@mail.tau.ac.il/files/Go%20to%20LynxKite.html (@10.128.0.69) 1.08ms
[I 2022-12-12 15:31:39.279 JupyterHub log:186] 303 GET /hub/user/XXXXXX@mail.tau.ac.il/files/Go%20to%20LynxKite.html (XXXXXX@mail.tau.ac.il@10.128.0.69) 4.71ms
[I 2022-12-12 15:31:39.502 JupyterHub pages:394] XXXXXX@mail.tau.ac.il is pending spawn
[I 2022-12-12 15:31:39.505 JupyterHub log:186] 200 GET /hub/spawn-pending/XXXXXX@mail.tau.ac.il?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Ffiles%2FGo%2520to%2520LynxKite.html (XXXXXX@mail.tau.ac.il@10.128.0.69) 7.97ms
[W 2022-12-12 15:33:23.346 JupyterHub user:824] XXXXXX@mail.tau.ac.il's server failed to start in 300 seconds, giving up.
[I 2022-12-12 15:33:23.347 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:26.518 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:29.865 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:33.638 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:37.538 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:41.187 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[E 2022-12-12 15:33:41.229 JupyterHub gen:630] Exception in Future <Task finished name='Task-6089322' coro=<KubeSpawner._start() done, defined at /usr/local/lib/python3.9/site-packages/kubespawner/spawner.py:2621> exception=TimeoutError('pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil did not start in 300 seconds!')> after timeout
    asyncio.exceptions.TimeoutError: pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil did not start in 300 seconds!
[I 2022-12-12 15:33:44.551 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:33:44.568 JupyterHub pages:394] XXXXXX@mail.tau.ac.il is pending stop
[I 2022-12-12 15:33:44.570 JupyterHub log:186] 200 GET /hub/spawn-pending/XXXXXX@mail.tau.ac.il?next=%2Fhub%2Fuser%2FXXXXXX%40mail.tau.ac.il%2Flab (XXXXXX@mail.tau.ac.il@10.128.0.69) 11.14ms
[E 2022-12-12 15:34:50.986 JupyterHub app:2461] Failed to poll spawner for XXXXXX@mail.tau.ac.il, assuming the spawner is not running.
[W 2022-12-12 15:34:50.987 JupyterHub app:2512] XXXXXX@mail.tau.ac.il appears to have stopped while the Hub was down
[W 2022-12-12 15:41:49.964 JupyterHub user:881] XXXXXX@mail.tau.ac.il's server never showed up at http://hub-65fd484dc-cbx7j:60937/user/XXXXXX@mail.tau.ac.il/ after 300 seconds. Giving up.
[W 2022-12-12 15:41:49.965 JupyterHub spawner:2258] Pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil url changed! hub-65fd484dc-cbx7j:60937 -> 10.108.20.3:8888
[I 2022-12-12 15:41:49.974 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:41:53.017 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:41:56.367 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:42:00.034 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil
[I 2022-12-12 15:42:03.956 JupyterHub spawner:2780] Deleting pod jh/jupyter-XXXXXX-40mail-2etau-2eac-2eil

(I've replaced the user name with XXXXXX. Sorry for the long block.)

The user's instance first started normally at 9:15. It was shut down for inactivity at 10:16. At 15:28 the user came back and JupyterHub tried to start up the instance again. Startup failed and JupyterHub tried to delete the instance at 15:33. It keeps trying for 10 minutes, but halfway decides that it "appears to have stopped while the Hub was down".

The Hub was down? Yes, looks like it went down at 15:33:

[C 2022-12-12 15:33:43.937 JupyterHub base:1037] Aborting due to 5 consecutive spawn failures

My guess is that restart glitched the cleanup process and the instance was left running. The next time the user tried it they got the "Volume is already exclusively attached" because the volume was attached to the leaked instance.

Expected behaviour

The startup failure may be my fault or GKE had a slow day. But JupyterHub shouldn't leave pods running.

How to reproduce

Seems very difficult. I have saved the JupyterHub logs from the incident. It includes some personal email addresses but I can share it privately.

Your personal set up

It's a normal Zero-to-JupyterHub on GKE.

# config.yaml
proxy:
  secretToken: ...
  https:
    enabled: true
    hosts: ...
    letsencrypt: ...
hub:
  config:
    JupyterHub:
      authenticator_class: google
    GoogleOAuthenticator: ...
  extraConfig:
    extra: |
      c.KubeSpawner.http_timeout = 300
singleuser: (a custom image)

Thanks for Zero-to-JupyterHub! It's the best!

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

jabbera commented 7 months ago

This was fixed here: https://github.com/jupyterhub/kubespawner/pull/742