jupyterhub / jupyterhub-idle-culler

JupyterHub service to cull idle servers and users
Other
102 stars 37 forks source link

Adding the option to delete persisted user data as well. #8

Open 1kastner opened 4 years ago

1kastner commented 4 years ago

Proposed change

At https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/ recently it was discussed that in some settings the persisted data of a user should also eventually be removed. This could be integrated into this service. I am really not sure whether it should because it is spawner-specific or even configuration-specific how to delete the persisted user data.

Alternative options

We could say that these concerns should be addressed separately and a second service could be created. The chances are pretty high that there would be more code duplication though to identify user accounts that exceed a certain age and haven't been used for a while.

Who would use this feature?

This is reasonable for settings with temporary users, e.g. mybinder or weekend seminars. You are sure you want to delete their data after some point.

manics commented 4 years ago

Together with https://github.com/jupyterhub/jupyterhub-idle-culler/issues/4 it almost sounds like we want a jupyter-admin cron utility. Perhaps with plugins for each function?

1kastner commented 4 years ago

For me that sounds like a reasonable approach. General plugins could be maintained in separate repositories to keep this clean. I hope from such a plugin I could still somehow reach the JupyterHub configuration so that the plugin can look up the details, such as which user "owns" which docker volume that should be deleted.

Regarding #4, in some cases it could make sense to link the plugin with the data source of the authenticator since that is a place e.g. email addresses are also stored. But again this might be very configration-specific. An alternative would be to have a separate plugin configuration file. From what I have seen until now I guess it violates the design principles of one centralized JupyterHub configuration though? Here I have a lack of experience with the JuypterHub design philosophy.

meeseeksmachine commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/10

1kastner commented 4 years ago

At https://jupyterhub.readthedocs.io/en/stable/reference/services.html I checked that a service can not access the loaded JupyterHub configuration. I see three options out there and I don't like any of them:

1) Re-load the JupyterHub configuration inside this service to get the information regarding user-specific persisted data 2) Maintain a separate cleaning scripts which leads to duplicated configuration (what is the user's prefix? ...) 3) Refactor JupyterHub so that they can access the configuration and remove the previously constructed isolation

Any ideas on this?

manics commented 4 years ago

The JupyterHub API lets you obtain state: object. Arbitrary internal state from this server's spawner. Only available on the hub's users list or get-user-by-name method, and only if a hub admin. None otherwise. for each of a user's servers: https://jupyterhub.readthedocs.io/en/stable/_static/rest-api/index.html#/definitions/Server

There's also a discussion about making auth_state from the Authenticator accessible too: https://github.com/jupyterhub/jupyterhub/issues/1704

Do you think the combination of these would allow a service to request the necesary information?

1kastner commented 4 years ago

In my example configuration here the important part regarding the Docker volumes is listed as such:

notebook_dir = '/home/jovyan/'
c.DockerSpawner.notebook_dir = notebook_dir
c.DockerSpawner.volumes = { 'jupyterhub-user-{username}': notebook_dir }

The data stored in c.DockerSpawner.volumes would be sufficient for my purpose of deleting docker volumes now but I can't speak for k8s configurations. For that I would need a partner on that side.

1kastner commented 4 years ago

So let's see if I got you right here: You suggest the DockerSpawner tells via the server state which docker volume belongs to it (in my case jupyterhub-user-{username}) and this (already serliaized) information is added to the auth_state but is only visible if you are admin (which the JupyterHub service is).

Once the JupyterHub service has the information jupyterhub-user-user0, it can also run docker volume rm jupyterhub-user-user0 and everything is fine and solved! So yes, that is a viable solution.

manics commented 4 years ago

Something like that! But this is outside my knowledge of JupyterHub, @minrk will know better.

rkdarst commented 4 years ago

I do something similar to this (if I understand correctly) to set unique cull times:

https://github.com/AaltoSciComp/jupyterhub-aalto/blob/8bb8c3f0d538641141c5024c272245f943747fd2/scripts/cull_idle_servers.py#L180 https://github.com/AaltoSciComp/jupyterhub-aalto/blob/8bb8c3f0d538641141c5024c272245f943747fd2/jupyterhub_config.py#L772

... requires a bit of care but in principle not too hard, and I think the idea works quite well.

1kastner commented 4 years ago

@rkdarst thank you so much for sharing that!

meeseeksmachine commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/13

1kastner commented 4 years ago

Actually the server state is alreay published, see this code - thanks to all participants of this conversation to arrive there! Since the generic spawner api does not prescribe any spawner-specific content and the docker spawner does only add a little information, this only needs some additional information regarding the name of the docker volumes as presented before (see top).

Due to this input and also due to the discussion in the forum, I would suggest that I first create my personal variation of the dockerspawner that shares the information I need (i.e. the docker volume name) and second create a copy of this service using a different name. That service would only remove docker volumes if they are expired for a long time (we might want different times for culling idle Jupyter Notebooks and deleting whatever data the user believed we would persist for them).

For the long term, a plugin architecture as mentioned by manics sounds great to me too!

rkdarst commented 4 years ago

A plugin system for this somehow seems like a lot of work, since to me JupyterHub can already take plugins. But the difficulty of making a new service is too much, there is a lot of boilerplate. Imagine if there was...

Of course I am biased since I have a working system... and don't have time implement what I am suggesting. But, I will work to use and debug whatever is implemented, if it doesn't add too many extra layers.

1kastner commented 4 years ago

That sounds fine to me as long as I get the mentioned work done!

manics commented 4 years ago

I think a service library also works, and even if we moved to a plugin architecture I expect the plugins would want this library anyway. What does everyone think about developing the library in this repo, then perhaps moving it to its own repo or JupyterHub core after it's had some production use?

1kastner commented 4 years ago

From your quick explanation I have many detail-related conceptual questions, e.g. some visualizations might help etc.

Regarding your development plan I am not sure. Why do you think it should move back to the core? I did not mean to split the community - I would rather like to see that the code from this repository evolves step-by-step including backwards-compability. Do you think that is too difficult either in technical or project-administrative terms?

manics commented 4 years ago

Ignore me! I misunderstood "A JupyterHub service library, that had the core event loop and periodic polling" as wanting a library in core JupyterHub. I'm completely happy for it to remain separate 🙂

manics commented 4 years ago

Also would it be OK to keep the design discussions on one issue? Either this or https://github.com/jupyterhub/jupyterhub-idle-culler/issues/9, we can rename the issue title as necessary to make it clearer.

1kastner commented 4 years ago

I am fine with that at #9 we can discuss some general design issues and here we discuss how this can be used for the purpose I have mentioned in the beginning of this issue, likewise #4 can pick up the results from #9 when implementing it. Therefore, i guess this issue might get less attention for a while until a common conceptual approach is found..

AtulSinghBankoti commented 4 years ago

Hi,

Is their any fix update for this issue?

We are also facing similar issue discussed here. We have jupyterhub deployed in k8s cluster & we are using EFS(Elastic File System) as PV(Persistent Volume). When we delete a jupyterhub user using admin panel. User is deleted but the PVC(Persistent Volume Claims) associated with that user are not getting deleted. If we create a new user with same name, the PVC are getting attached to new user.

meeseeksmachine commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/16

ruchirkakkad commented 2 years ago

Hi,

I do not want to remove admin user in Jupyterhub by cull. I have implemented Jupyterhub using helm chart.