jupyterhub / zero-to-jupyterhub-k8s

Helm Chart & Documentation for deploying JupyterHub on Kubernetes
https://zero-to-jupyterhub.readthedocs.io
Other
1.56k stars 799 forks source link

Helm chart 4.0.0-beta.2 cannot find existing pvc for user home dir. #3536

Closed dragz closed 1 month ago

dragz commented 1 month ago

Bug description

Just ran an upgrade from version 3.3.7 and found that the spawn fails because it cannot find the users home dir PVC.

How to reproduce

  1. Set up a cluster with 3.3.7 and get some users into it.
  2. Update to helm chart v4.0.0-beta.2
  3. Log in as a user
  4. Spawn fails with this message:
Spawn failed: pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds!

Event log
Server requested
2024-10-03T10:50:38.416759Z [Warning] 0/1 nodes are available: persistentvolumeclaim "claim-abc000-uit-no---1ea1e26d" not found. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
2024-10-03T10:50:47Z [Normal] pod didn't trigger scale-up: 2 persistentvolumeclaim "claim-abc000-uit-no---1ea1e26d" not found
Spawn failed: pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds!

Expected behaviour

It should look for a PVC named claim-abc000-40uit-2eno which was created before the upgrade.

Actual behaviour

It tries to find a PVC named claim-abc000-uit-no---1ea1e26d which does not exist

Your personal set up

Rather straight forward setup with Azure Kubernetes Service using zero-to-jupyterhub with autoscaling and EntraID for authentication.

Configuration ```python # jupyterhub_config.py ```
Logs ``` Loading /usr/local/etc/jupyterhub/secret/values.yaml No config at /usr/local/etc/jupyterhub/existing-secret/values.yaml Loading extra config: checkvolumes [I 2024-10-03 10:50:13.022 JupyterHub app:3352] Running JupyterHub version 5.2.0 [I 2024-10-03 10:50:13.023 JupyterHub app:3382] Using Authenticator: oauthenticator.azuread.AzureAdOAuthenticator-17.0.0 [I 2024-10-03 10:50:13.023 JupyterHub app:3382] Using Spawner: kubespawner.spawner.KubeSpawner-7.0.0b2 [I 2024-10-03 10:50:13.023 JupyterHub app:3382] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-5.2.0 [I 2024-10-03 10:50:13.052 JupyterHub dbutil:129] Upgrading sqlite:///jupyterhub.sqlite [I 2024-10-03 10:50:13.052 JupyterHub dbutil:100] Backing up jupyterhub.sqlite => jupyterhub.sqlite.2024-10-03-105013 [I 2024-10-03 10:50:14.117 alembic.runtime.migration migration:215] Context impl SQLiteImpl. [I 2024-10-03 10:50:14.117 alembic.runtime.migration migration:218] Will assume non-transactional DDL. [I 2024-10-03 10:50:14.127 alembic.runtime.migration migration:623] Running upgrade 0eee8c825d24 -> 3c2384c5aae1, Add from_config column to the services table [I 2024-10-03 10:50:14.174 alembic.runtime.migration migration:623] Running upgrade 3c2384c5aae1 -> 4621fec11365, manage roles [I 2024-10-03 10:50:14.586 JupyterHub roles:210] Role attribute admin.scopes has been changed [I 2024-10-03 10:50:14.696 JupyterHub app:2925] Creating service jupyterhub-idle-culler without oauth. [I 2024-10-03 10:50:14.738 JupyterHub roles:281] Adding role admin for User: abc000@uit.no [I 2024-10-03 10:50:14.786 JupyterHub app:3422] Initialized 0 spawners in 0.014 seconds [I 2024-10-03 10:50:14.793 JupyterHub metrics:373] Found 0 active users in the last ActiveUserPeriods.twenty_four_hours [I 2024-10-03 10:50:14.794 JupyterHub metrics:373] Found 0 active users in the last ActiveUserPeriods.seven_days [I 2024-10-03 10:50:14.795 JupyterHub metrics:373] Found 1 active users in the last ActiveUserPeriods.thirty_days [I 2024-10-03 10:50:14.795 JupyterHub app:3702] Not starting proxy [I 2024-10-03 10:50:14.804 JupyterHub app:3738] Hub API listening on http://:8081/hub/ [I 2024-10-03 10:50:14.804 JupyterHub app:3740] Private Hub API connect url http://hub:8081/hub/ [I 2024-10-03 10:50:14.805 JupyterHub app:3621] Starting managed service jupyterhub-idle-culler [I 2024-10-03 10:50:14.805 JupyterHub service:423] Starting service 'jupyterhub-idle-culler': ['python3', '-m', 'jupyterhub_idle_culler', '--url=http://localhost:8081/hub/api', '--timeout=3600', '--cull-every=600', '--concurrency=10'] [I 2024-10-03 10:50:14.807 JupyterHub service:136] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10 [I 2024-10-03 10:50:14.810 JupyterHub proxy:477] Adding route for Hub: / => http://hub:8081 [I 2024-10-03 10:50:14.817 JupyterHub app:3781] JupyterHub is now running, internal Hub API at http://hub:8081/hub/ [I 2024-10-03 10:50:15.166 JupyterHub log:192] 200 GET /hub/api/ (jupyterhub-idle-culler@::1) 27.87ms [I 2024-10-03 10:50:15.188 JupyterHub log:192] 200 GET /hub/api/users?state=[secret] (jupyterhub-idle-culler@::1) 18.79ms [I 2024-10-03 10:50:26.307 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.87ms [I 2024-10-03 10:50:28.890 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.81ms [I 2024-10-03 10:50:30.204 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 1.00ms [W 2024-10-03 10:50:30.305 JupyterHub spawner:2546] Loading state for abc000@uit.no/ from unknown prior version of kubespawner (likely 6.x), will attempt to upgrade. [I 2024-10-03 10:50:30.306 JupyterHub log:192] 302 GET /hub/ -> /hub/spawn (abc000@uit.no@::ffff:10.244.1.1) 40.00ms [I 2024-10-03 10:50:30.413 JupyterHub _xsrf_utils:125] Setting new xsrf cookie for b'KVMkVzPaApnZAJdX5_Z_TUdiGiiyYJiT3q211Kuclgw=:cf609962c22f43ed8cb6ad6bafc27d0c' {'path': '/hub/'} [I 2024-10-03 10:50:30.469 JupyterHub log:192] 200 GET /hub/spawn (abc000@uit.no@::ffff:10.244.1.1) 87.21ms [I 2024-10-03 10:50:38.131 JupyterHub provider:661] Creating oauth client jupyterhub-user-abc000%40uit.no [I 2024-10-03 10:50:38.196 JupyterHub :7] check_pvcs: Doing a PVC check... [I 2024-10-03 10:50:38.198 JupyterHub log:192] 302 POST /hub/spawn?_xsrf=[secret] -> /hub/spawn-pending/abc000@uit.no?_xsrf=[secret] (abc000@uit.no@::ffff:10.244.1.1) 254.35ms [I 2024-10-03 10:50:38.248 JupyterHub :11] check_pvcs: existing_pvc-names ['claim-bto004-40uit-2eno', 'claim-rba004-40uit-2eno', 'claim-abc000-40uit-2eno', 'hub-db-dir', 'private-datadir-abc000-40uit-2eno'] [I 2024-10-03 10:50:38.249 JupyterHub :16] check_pvcs: claimname claim-abc000-uit-no---1ea1e26d [I 2024-10-03 10:50:38.249 JupyterHub :17] check_pvcs: volume {'name': 'volume-{username}{servername}', 'persistentVolumeClaim': {'claimName': 'claim-{username}{servername}'}} [I 2024-10-03 10:50:38.250 JupyterHub :16] check_pvcs: claimname private-datadir-abc000-uit-no---1ea1e26d [I 2024-10-03 10:50:38.250 JupyterHub :17] check_pvcs: volume {'name': 'user-datadir', 'persistentVolumeClaim': {'claimName': 'private-datadir-{username}'}} [I 2024-10-03 10:50:38.250 JupyterHub :20] check_pvcs: Removing nonexistant PVC {'name': 'user-datadir', 'persistentVolumeClaim': {'claimName': 'private-datadir-{username}'}} [I 2024-10-03 10:50:38.251 JupyterHub :29] check_pvcs: Nonexisting PVC, removing mountpath {'mountPath': '/home/datadir', 'name': 'user-datadir'} [I 2024-10-03 10:50:38.263 JupyterHub pages:397] abc000@uit.no is pending spawn [I 2024-10-03 10:50:38.268 JupyterHub log:192] 200 GET /hub/spawn-pending/abc000@uit.no?_xsrf=[secret] (abc000@uit.no@::ffff:10.244.1.1) 11.19ms [I 2024-10-03 10:50:38.270 JupyterHub reflector:297] watching for pods with label selector='component=singleuser-server' in namespace jupyter [I 2024-10-03 10:50:38.304 JupyterHub reflector:297] watching for events with field selector='involvedObject.kind=Pod' in namespace jupyter [W 2024-10-03 10:50:38.346 JupyterHub spawner:3138] Using legacy pvc claim-abc000-40uit-2eno for abc000@uit.no [I 2024-10-03 10:50:38.347 JupyterHub spawner:2940] Attempting to create pvc claim-abc000-40uit-2eno, with timeout 3 [I 2024-10-03 10:50:38.367 JupyterHub spawner:2956] PVC claim-abc000-40uit-2eno already exists, so did not create new pvc. [I 2024-10-03 10:50:38.370 JupyterHub spawner:2899] Attempting to create pod jupyter-abc000-uit-no---1ea1e26d, with timeout 3 [I 2024-10-03 10:50:56.307 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.88ms [I 2024-10-03 10:50:58.892 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.87ms [I 2024-10-03 10:51:26.306 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.70ms [I 2024-10-03 10:51:28.891 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.75ms [I 2024-10-03 10:51:56.313 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.80ms [I 2024-10-03 10:51:58.892 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 1.12ms [I 2024-10-03 10:52:26.311 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.72ms [I 2024-10-03 10:52:28.890 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.71ms [I 2024-10-03 10:52:56.311 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.91ms [I 2024-10-03 10:52:58.897 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.85ms [I 2024-10-03 10:53:26.314 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.91ms [I 2024-10-03 10:53:28.891 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.81ms [I 2024-10-03 10:53:56.316 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.84ms [I 2024-10-03 10:53:58.893 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.75ms [I 2024-10-03 10:54:26.319 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.95ms [I 2024-10-03 10:54:28.894 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.75ms [I 2024-10-03 10:54:56.318 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.74ms [I 2024-10-03 10:54:58.903 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.82ms [W 2024-10-03 10:55:25.628 JupyterHub user:999] abc000@uit.no's server failed to start in 300 seconds, giving up. Common causes of this timeout, and debugging tips: 1. Everything is working, but it took too long. To fix: increase `Spawner.start_timeout` configuration to a number of seconds that is enough for spawners to finish starting. 2. The server didn't finish starting, or it crashed due to a configuration issue. Check the single-user server's logs for hints at what needs fixing. [I 2024-10-03 10:55:25.628 JupyterHub spawner:3291] Deleting pod jupyter/jupyter-abc000-uit-no---1ea1e26d [E 2024-10-03 10:55:25.768 JupyterHub gen:629] Exception in Future .finish_user_spawn() done, defined at /usr/local/lib/python3.12/site-packages/jupyterhub/handlers/base.py:1115> exception=TimeoutError('pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds!')> after timeout Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/tornado/gen.py", line 624, in error_callback future.result() File "/usr/local/lib/python3.12/site-packages/jupyterhub/handlers/base.py", line 1122, in finish_user_spawn await spawn_future File "/usr/local/lib/python3.12/site-packages/jupyterhub/user.py", line 1021, in spawn raise e File "/usr/local/lib/python3.12/site-packages/jupyterhub/user.py", line 920, in spawn await asyncio.wait_for(f, timeout=spawner.start_timeout) File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/kubespawner/spawner.py", line 3243, in _start await exponential_backoff( File "/usr/local/lib/python3.12/site-packages/jupyterhub/utils.py", line 265, in exponential_backoff raise asyncio.TimeoutError(fail_message) TimeoutError: pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds! [I 2024-10-03 10:55:25.775 JupyterHub log:192] 200 GET /hub/api/users/abc000@uit.no/server/progress?_xsrf=[secret] (abc000@uit.no@::ffff:10.244.1.1) 287317.52ms [I 2024-10-03 10:55:26.320 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.76ms [I 2024-10-03 10:55:28.906 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.72ms [I 2024-10-03 10:55:56.328 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.77ms ```
consideRatio commented 1 month ago

Thank you @dragz for trying out the beta release and reporting this back!!!

The key parts from this report is that kubespawner 7.0.0b2 in z2jh 4.0.0-beta.2 resulted in...

[W 2024-10-03 10:50:38.346 JupyterHub spawner:3138] Using legacy pvc claim-abc000-40uit-2eno for abc000@uit.no
[I 2024-10-03 10:50:38.347 JupyterHub spawner:2940] Attempting to create pvc claim-abc000-40uit-2eno, with timeout 3
[I 2024-10-03 10:50:38.367 JupyterHub spawner:2956] PVC claim-abc000-40uit-2eno already exists, so did not create new pvc.
2024-10-03T10:50:38.416759Z [Warning] 0/1 nodes are available: persistentvolumeclaim "claim-abc000-uit-no---1ea1e26d" not found.

@minrk it seems from this that kubespawner was smart and concluded it should re-use the old PVC, but that it at the same time failed to do so.

consideRatio commented 1 month ago

Note from the default values of z2jh, we have this btw @minrk, so we have explicit and defaults overriding the kubespawner defaults.

  storage:
    type: dynamic
    extraLabels: {}
    extraVolumes: []
    extraVolumeMounts: []
    static:
      pvcName:
      subPath: "{username}"
    capacity: 10Gi
    homeMountPath: /home/jovyan
    dynamic:
      storageClass:
      pvcNameTemplate: claim-{username}{servername}
      volumeNameTemplate: volume-{username}{servername}
      storageAccessModes: [ReadWriteOnce]
      subPath:
minrk commented 1 month ago

pvcNameTemplate should probably be removed, since that's already the default. volumeNameTemplate doesn't match the kubespawner default, so it should still be defined. But where we have {username}{servername}, we should replace it with {user_server} (which is identical in escape scheme).

I'll investigate the tests to see why they don't cover this case, I thought they did.

minrk commented 1 month ago

(unless pvcNameTemplate is consumed in the chart, which I suspect it actually is. In that case, it should stay and be updated to {user_server})

minrk commented 1 month ago

I think I understand it now, the chart actually defines the volume:pvc mount, so even though the pvc itself is correct, the volume mount points to the pvc name that doesn't exist. Fortunately, the updated templates address this, because there is a {pvc_name} field that resolves directly to spawner.pvc_name, however that may be resolved.

consideRatio commented 1 month ago

Thank you @minrk!! @dragz could you try if this resolved things for you by using the latest development release of the chart?

It will be listed soon at https://hub.jupyter.org/helm-chart/#development-releases-jupyterhub

It is version 4.0.0-beta.2.git.6792.hc3e818e9

dragz commented 1 month ago

Yes, this fixed the issue. Home dir is now found and mounted correctly.

(I'm impressed with the speed here)

consideRatio commented 1 month ago

Thank you for testing, reporting, and verifying the attempted fix @dragz!!

4.0.0-beta.3 released for this fix btw