jupyterhub / zero-to-jupyterhub-k8s

Helm Chart & Documentation for deploying JupyterHub on Kubernetes
https://zero-to-jupyterhub.readthedocs.io
Other
1.56k stars 799 forks source link

Z2JH 4.0.0 fails to detect old named server PVCs due to an overriden template in 3.3.8 #3574

Open manics opened 6 days ago

manics commented 6 days ago

Bug description

Reported in https://discourse.jupyter.org/t/jupyterhub-helm-chart-4-0-0/29887/2

How to reproduce

  1. Deploy Z2JH 3.3.8 with persistent user storage
  2. Start and stop a named server
  3. Upgrade Z2JH to 4.0.0
  4. Start the named server, a new PVC is created

Expected behaviour

Z2JH should use the existing named PVC

Actual behaviour

3.3.8 set pvcNameTemplate: claim-{username}{servername}, overriding the KubeSpawner 6.2.0 default of claim-{username}--{servername} https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/3.3.8/jupyterhub/values.yaml#L395 https://github.com/jupyterhub/kubespawner/blob/6.2.0/kubespawner/spawner.py#L569

4.0.0 uses the default from KubeSpawner https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/4.0.0/jupyterhub/values.yaml#L416

As a result the automatic legacy PVC detection in KubeSpawner fails, since it assumes the 6.2.0 template is used https://github.com/jupyterhub/kubespawner/blob/a8f28439078e42e8012de8f141b51bd6fa96d9c7/kubespawner/spawner.py#L3118

[D 2024-11-15 14:05:10.037 JupyterHub spawner:3117] Checking for legacy-named pvc claim-test-40example-2eorg--named for test@example.org

Instead of claim-test-40example-2eorgnamed

minrk commented 2 days ago

Short-term, a deployment can work around this by

  1. explicitly setting the old template in config
  2. starting and stopping all servers, which persists the pvc name in spawner state
  3. removing the old template config; since PVC names are persisted now, new users will use the new template while existing users will keep their PVCs

alternatively, we could write a bulk-rename script to rename PVCs.

aychang95 commented 1 day ago

@minrk A bulk-rename script would work for our case. Could also support downgrading if that's within the scope. May have to deal with servers that now have two PVCs associated with them

minrk commented 11 hours ago

yes, servers that have upgraded and associated with new pvcs are going to be the hardest because we can't know if there's any data in the new pvcs that needs to be preserved, so the only safe option is to merge their contents, which we can't do automatically. But we could at least:

  1. safely rename every old pvc, and
  2. if any pvc was created under the new name would result in a collision, rename the new one identifiably so that they can be manually inspected for merge

or perhaps the simplest is a script that iterates over existing PVCs and servers and pushes the pvc name into Spawner.state (i.e. effectively the same thing as preserving the old template, starting/stopping servers, then updating the template).

minrk commented 11 hours ago

Here is a script that does the last suggestion: https://gist.github.com/minrk/7ad21b18f7a5b3908d74f108dc564cd5

it collects all pvcs and spawners, then associates spawners with their pvcs.

  1. if there's only one match, persist the link in spawner.state (avoids the above bug, as if past versions of kubespawners implemented pvc_name persistence)
  2. if there's more than one, it picks the oldest match (this is what will happen for deployments that have already been upgraded and the template mismatch caused the creation of a new, empty pvc)

it doesn't touch the pvcs themselves, just the association to the servers. Any dealing with removing new pvcs / merging content is something that has to be taken on a case-by-case basis.

manics commented 10 hours ago

Do you think we should put your script into an optional initContainer in this chart? Perhaps with a flag to control how to resolve multiple matching PVCs (oldest, newest, or return an error to force the admin to sort things out before starting JupyterHub).

minrk commented 9 hours ago

Yes, possibly. I'm also investigating whether this belongs in the legacy pvc check in kubespawner, because I think we can change that to consider labels and annotations as well, so it's not sensitive to template changes across the upgrade.

Either way, if we run this automatically for users, we will need to be careful to consider valid use cases of deliberate sharing of PVCs across servers, e.g.

  1. one pvc per user for all servers: pvc_name_template = '{username}'
  2. shared pvc across users for a given named server: pvc_name_template = '{servername}'
  3. one pvc for everyone (?!): pvc_name_template = 'shared'
  4. manually created PVCs out-of-band, may be missing labels and/or annotations which were not checked previously

whereas it is appropriate for the script above to only handle the default template case, since it is only run by hand, so users have a chance to make a decision whether it is appropriate before it runs.