jupyterhub / binderhub

Run your code in the cloud, with technology so advanced, it feels like magic!
https://binderhub.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.57k stars 390 forks source link

Persistent storage #377

Open choldgraf opened 6 years ago

choldgraf commented 6 years ago

I've had a number of people (especially at universities) ask me if it'd possible to enable persistent storage with BinderHub. The most common use case seems to be to use a BinderHub server to let teachers create repositories/environments for their classes / bootcamps / etc, but they'd like students to have their own "space" where things will persist over time.

I anticipate these requests to just increase with time, but it's also a bit unclear to me exactly how this functionality would be combined w/ BinderHub. Maybe this could behave in the same way that nbgitpuller does.

Just opening this issue since I noticed we don't have another place in this repo where we discuss the topic. I'm curious if people have thoughts on a path forward for something like this.

Current status

https://github.com/jupyterhub/binderhub/issues/377#issuecomment-353501247

The primary thing preventing us from using persistent storage with binder is Authentication. We have no idea which disk should be mounted for which user. #323 should help fix that

betatim commented 6 years ago

You could use this to provide access to data or longer term storage. Providing a mechanism/example config for people to setup read-only storage that gets mounted to say ~/data and a read-write storage mounted to ~/personal would be useful I think. One thing that is a bit tricky is that a repo could contain a directory with the same name as personal/ or data/. Maybe mount the repo at ~/repo instead of directly at ~ (https://github.com/jupyter/repo2docker/pull/134)

yuvipanda commented 6 years ago

The primary thing preventing us from using persistent storage with binder is Authentication. We have no idea which disk should be mounted for which user. #323 should help fix that.

@betatim Mounting the repo in a path under $HOME won't help, since mounting persistent volumes in $HOME will just overwrite everything under. What you need is ability to mount the persistent volume somewhere else, which is already supported. We could also consider a postStart hook that copies files over. We aren't dependent on that repo2docker PR for this particular feature IMO

betatim commented 6 years ago

You could mount the repository at ~/repo and a read only volume at ~/data and they'd happily coexist no?

You could also mount the shared/persistent volumes somewhere like /data but then you can't navigate there with the jupyter tree view because that is rooted in ~/.

yuvipanda commented 6 years ago

I agree! It's not necessary for supporting persistent volumes, but very nice to have! Authentication is a blocker though.

choldgraf commented 6 years ago

ok cool - I've updated the top-level comment w/ the current state of this issue so we can keep track of what needs to be done

ctb commented 6 years ago

Curious - this is handled by JupyterHub, right? The difference is JupHub doesn't built new images/use multiple images...?

choldgraf commented 6 years ago

@ctb yep, JupyterHub can serve a pre-existing docker image that's in a registry, but it doesn't have the machinery to automatically build/register images from git repositories. Hopefully as @yuvipanda says, we can eventually merge these so they don't have to be two totally separate things, then Binder is more of a service, rather than a service and a specific piece of technology that's custom built for it.

arnim commented 5 years ago

The primary thing preventing us from using persistent storage with binder (https://github.com/jupyterhub/binderhub/pull/666) is now in place and we would like to go ahead with persistent storage. However

“”in a vanilla jupyterhub we mount the persistent disk to /home/jovyan now we combine them and ... mount both to /home/jovyan?”” (@betatim, at gitter)

A number of proposals have been discussed which seem to fall roughly into these categories: repository at ~/repo and persistent storage at ~/data repository at ~ (as is the case now) and a persistent storage at ~/data persistent storage at ~ and repository at ~/<repo> or ~/<repo-name> Each user would then have his own persistent storage that is shared across sessions.

I think it would be desirable to have persistency at ~ (as in jupyterhub) and having it at ~/data seems already to be possible with something like jupyterlab-google-drive. This makes option 3 look currently the most useful to me. What are your thoughts?

We would like to implement persistency and while there have already been numerous discussions in different directions (@yuvipanda here or @nthiery here) it would be good to have some more understanding about the consequences of the different options (e.g. are there repositories that assume to be mounted at ~, what could be the role of nbgitpuller).

betatim commented 5 years ago

I am working on setting up a hub with auth and persistent storage (exploring options for @nthiery) over the next few weeks -> we should coordinate.

My short term plan to get something working and used by people is to mount the repo to ~/repo and the persistent volume to ~/home.

The next iteration would be to explore how having /home/jovyan be a persistent volume. With repo2docker copying the contents of the repository to /repo and using nbgitpuller to copy/pull stuff over to /home/jovyan/<repo-name> when the container launches. This means you'd get the semantics of nbgitpuller for keeping changes to the repo (or not).

Both require some work on repo2docker, BinderHub and how the hub is deployed. What do you think of this kind of two stage approach? I'm not sure I am 100% convinced of the second phase yet as the "perfect" solution (and it will require a bit of work in repo2docker) hence going for something simpler first to gain some more ideas and experience.

Some things I am pondering:

arnim commented 5 years ago

I am working on setting up a hub with auth and persistent storage (exploring options for @nthiery) over the next few weeks -> we should coordinate.

Sure

My short term plan to get something working and used by people is to mount the repo to ~/repo and the persistent volume to ~/home.

I think we had already something like this running. @bitnik is that correct?

The next iteration would be to explore how having /home/jovyan be a persistent volume. With repo2docker copying the contents of the repository to /repo and using nbgitpuller to copy/pull stuff over to /home/jovyan/ when the container launches.

This is what would imo allow the users to keep their expectations on how JHub behaves and is what we are currently aiming at. Yet, we are likewise not 100% convinced that this is the final "perfect" solution.

bitnik commented 5 years ago

I am working on setting up a hub with auth and persistent storage (exploring options for @nthiery) over the next few weeks -> we should coordinate.

Sure

so how to coordinate best?

I think we had already something like this running. @bitnik is that correct?

Yes, once we tried it by using Kubespawener.lifecycle_hooks.postStart which does some cp, rm and ln but I am not sure if it was a good implementation.