Open 1kastner opened 4 years ago
Together with https://github.com/jupyterhub/jupyterhub-idle-culler/issues/4 it almost sounds like we want a jupyter-admin cron utility. Perhaps with plugins for each function?
For me that sounds like a reasonable approach. General plugins could be maintained in separate repositories to keep this clean. I hope from such a plugin I could still somehow reach the JupyterHub configuration so that the plugin can look up the details, such as which user "owns" which docker volume that should be deleted.
Regarding #4, in some cases it could make sense to link the plugin with the data source of the authenticator since that is a place e.g. email addresses are also stored. But again this might be very configration-specific. An alternative would be to have a separate plugin configuration file. From what I have seen until now I guess it violates the design principles of one centralized JupyterHub configuration though? Here I have a lack of experience with the JuypterHub design philosophy.
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:
https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/10
At https://jupyterhub.readthedocs.io/en/stable/reference/services.html I checked that a service can not access the loaded JupyterHub configuration. I see three options out there and I don't like any of them:
1) Re-load the JupyterHub configuration inside this service to get the information regarding user-specific persisted data 2) Maintain a separate cleaning scripts which leads to duplicated configuration (what is the user's prefix? ...) 3) Refactor JupyterHub so that they can access the configuration and remove the previously constructed isolation
Any ideas on this?
The JupyterHub API lets you obtain state: object. Arbitrary internal state from this server's spawner. Only available on the hub's users list or get-user-by-name method, and only if a hub admin. None otherwise.
for each of a user's servers:
https://jupyterhub.readthedocs.io/en/stable/_static/rest-api/index.html#/definitions/Server
There's also a discussion about making auth_state
from the Authenticator accessible too:
https://github.com/jupyterhub/jupyterhub/issues/1704
Do you think the combination of these would allow a service to request the necesary information?
In my example configuration here the important part regarding the Docker volumes is listed as such:
notebook_dir = '/home/jovyan/'
c.DockerSpawner.notebook_dir = notebook_dir
c.DockerSpawner.volumes = { 'jupyterhub-user-{username}': notebook_dir }
The data stored in c.DockerSpawner.volumes
would be sufficient for my purpose of deleting docker volumes now but I can't speak for k8s configurations. For that I would need a partner on that side.
So let's see if I got you right here: You suggest the DockerSpawner tells via the server state which docker volume belongs to it (in my case jupyterhub-user-{username}
) and this (already serliaized) information is added to the auth_state but is only visible if you are admin (which the JupyterHub service is).
Once the JupyterHub service has the information jupyterhub-user-user0
, it can also run docker volume rm jupyterhub-user-user0
and everything is fine and solved! So yes, that is a viable solution.
Something like that! But this is outside my knowledge of JupyterHub, @minrk will know better.
I do something similar to this (if I understand correctly) to set unique cull times:
https://github.com/AaltoSciComp/jupyterhub-aalto/blob/8bb8c3f0d538641141c5024c272245f943747fd2/scripts/cull_idle_servers.py#L180 https://github.com/AaltoSciComp/jupyterhub-aalto/blob/8bb8c3f0d538641141c5024c272245f943747fd2/jupyterhub_config.py#L772
... requires a bit of care but in principle not too hard, and I think the idea works quite well.
@rkdarst thank you so much for sharing that!
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:
https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/13
Actually the server state is alreay published, see this code - thanks to all participants of this conversation to arrive there! Since the generic spawner api does not prescribe any spawner-specific content and the docker spawner does only add a little information, this only needs some additional information regarding the name of the docker volumes as presented before (see top).
Due to this input and also due to the discussion in the forum, I would suggest that I first create my personal variation of the dockerspawner that shares the information I need (i.e. the docker volume name) and second create a copy of this service using a different name. That service would only remove docker volumes if they are expired for a long time (we might want different times for culling idle Jupyter Notebooks and deleting whatever data the user believed we would persist for them).
For the long term, a plugin architecture as mentioned by manics sounds great to me too!
A plugin system for this somehow seems like a lot of work, since to me JupyterHub can already take plugins. But the difficulty of making a new service is too much, there is a lot of boilerplate. Imagine if there was...
python my_service.py
- without needing to figure out an extra layer of a separate cron utility ("plugin" makes me think entrypoints, which is a whole other layer to think about, and then does it have to be separately installed?).Of course I am biased since I have a working system... and don't have time implement what I am suggesting. But, I will work to use and debug whatever is implemented, if it doesn't add too many extra layers.
That sounds fine to me as long as I get the mentioned work done!
I think a service library also works, and even if we moved to a plugin architecture I expect the plugins would want this library anyway. What does everyone think about developing the library in this repo, then perhaps moving it to its own repo or JupyterHub core after it's had some production use?
From your quick explanation I have many detail-related conceptual questions, e.g. some visualizations might help etc.
Regarding your development plan I am not sure. Why do you think it should move back to the core? I did not mean to split the community - I would rather like to see that the code from this repository evolves step-by-step including backwards-compability. Do you think that is too difficult either in technical or project-administrative terms?
Ignore me! I misunderstood "A JupyterHub service library, that had the core event loop and periodic polling" as wanting a library in core JupyterHub. I'm completely happy for it to remain separate 🙂
Also would it be OK to keep the design discussions on one issue? Either this or https://github.com/jupyterhub/jupyterhub-idle-culler/issues/9, we can rename the issue title as necessary to make it clearer.
I am fine with that at #9 we can discuss some general design issues and here we discuss how this can be used for the purpose I have mentioned in the beginning of this issue, likewise #4 can pick up the results from #9 when implementing it. Therefore, i guess this issue might get less attention for a while until a common conceptual approach is found..
Hi,
Is their any fix update for this issue?
We are also facing similar issue discussed here. We have jupyterhub deployed in k8s cluster & we are using EFS(Elastic File System) as PV(Persistent Volume). When we delete a jupyterhub user using admin panel. User is deleted but the PVC(Persistent Volume Claims) associated with that user are not getting deleted. If we create a new user with same name, the PVC are getting attached to new user.
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:
https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/16
Hi,
I do not want to remove admin user in Jupyterhub by cull. I have implemented Jupyterhub using helm chart.
Proposed change
At https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/ recently it was discussed that in some settings the persisted data of a user should also eventually be removed. This could be integrated into this service. I am really not sure whether it should because it is spawner-specific or even configuration-specific how to delete the persisted user data.
Alternative options
We could say that these concerns should be addressed separately and a second service could be created. The chances are pretty high that there would be more code duplication though to identify user accounts that exceed a certain age and haven't been used for a while.
Who would use this feature?
This is reasonable for settings with temporary users, e.g. mybinder or weekend seminars. You are sure you want to delete their data after some point.