a binder deployment with authentication and persistent storage

bitnik commented 5 years ago

Currently we are working on a binder deployment with authentication and persistent storage enabled and with a user interface in JupyterHub home page, where users can manage their repositories/projects.

For this purpose we have now a deployment running on https://notebooks-test.gesis.org/jupyter/. When you first login, you will see the JupyterHub home page (https://notebooks-test.gesis.org/jupyter/hub/home) with 2 parts: "Your projects" table and the classical binder form with some parts hidden:

ss1

Binder is running under https://notebooks-test.gesis.org/jupyter/services/binder/ and you can also use it but in this deployment the idea is that you don't need to use it directly.

How it works

Firstly some preliminary information:

Each user can start 1 server at a time (named servers are not activated)
Each user gets 1 persistent volume and it is mounted on /home/jovyan

Binder form

It is the classical form with 'share url' and 'badge url' parts are hidden. And it has 1 limitation: branch/tag/commit field is readonly and always "master". When user launches a repo via form:

always the latest version of the repo is built (last commit in master branch) and server is started with this image
nbgitpuller is used to pull the code under a sub directory /home/jovyan/{repo_dir}. repo_dir is generated by using provider name, user/org name and repo name. And server is started on that sub directory (you can start a new terminal and there you can list all directories of projects). nbgitpuller is not executed for the default repo (gesiscss/data_science_image).
each new launched repo is added into "Your Projects" table. This list is saved in state field of Spawners table and only last 10 launched repos are saved.

In short, binder form is used to create a new project and update it from remote.

Your Projects

When first login, user has there only the default repo (gesiscss/data_science_image). Each repo which is built and launched via binder form is added in this table and user can re-start that repository by using the start buttons on each row. When user clicks on a start button in the table:

A server started by using the image (commit) that user last time worked with
Right now it is not working but we want to skip nbgitpuller command execution on server start when server is started from projects table, so that user can continue working on where they left. We can do this by passing an option to spawner (I think this is very related to https://github.com/jupyterhub/binderhub/issues/712)
We are also thinking about having a delete button in the actions of table which removes the repository from the table and deletes the folder of the repo in user's persistent volume. Right now we have the button in the actions column but it doesn't do anything.

In short, "Your Projects" table is used to continue working on a repo (when you don't want o update the image or code base from remote).

Limitations and missing parts summary

nbgitpuller must be installed in user images, right now we use appendix to ensure its installation (maybe it can be added into repo2docker defaults)
Users can start a new project only from master (by using the binder form), they can't start to work on a repo from previous version/commit of it
Server start from table also executes nbgitpuller
Delete button doesn't do anything
Name generation of sub directories of each repo/project can be done better

Where to find helm config and custom templates

Helm config file for this deployment: https://github.com/gesiscss/orc/blob/binderinjhubgh/jupyterhub/config_test.yaml
And you can find the customised KubeSpawner here: https://github.com/gesiscss/orc/blob/binderinjhubgh/jupyterhub/config_test.yaml#L170-L227
Templates for JupyterHub (home.html is jupyterhub home page): https://github.com/gesiscss/orc/tree/binderinjhubgh/jupyterhub/docker/k8s_hub/templates

https://notebooks-test.gesis.org/jupyter/ uses github authenticator and everybody is welcome to login and try it out (it is just a test instance and will be deleted again). We really would like to get your feedback about what we have done so far. Probably most important question is if we are on the right track to accomplish what we want. And finally we are aware that there are a lot to improve for user interface.

betatim commented 5 years ago

This is pretty cool!

rabernat commented 5 years ago

This is kind of blowing my mind. It's exactly what we need for Pangeo.

bitnik commented 5 years ago

I deleted that deployment yesterday and did a new one today without any GESIS related parts in helm config and templates. I hope it helps people who had interest.

It works exactly same as previous one, just base url is changed from /jupyter/ to /: https://notebooks.gesis.org/. But I don't know how long this one will stay alive, because these day we are trying out many different things.

Config files:

persistent storage related helm config: https://github.com/gesiscss/example-binderhub-deployments/blob/master/persistent_storage/config.yaml
auth related helm config: https://github.com/gesiscss/example-binderhub-deployments/blob/master/auth.yaml
base helm config: https://github.com/gesiscss/example-binderhub-deployments/blob/master/config.yaml

Dockerfile for JupyterHub with custom templates (home template with binder form): https://github.com/gesiscss/example-binderhub-deployments/tree/master/persistent_storage/jupyterhub

betatim commented 5 years ago

One thing that I think would be nice is to have a symbolic link in each repository called "data" or "home" or something that points to a directory that is a sibling to the various repositories.

A directory structure like

/home/jovyan/repositoryA
/home/jovyan/repositoryA/data -> ../data
/home/jovyan/repositoryB
/home/jovyan/repositoryB/data -> ../data
/home/jovyan/data

to make it easier for people to navigate from the Jupyter file browser view to a place outside the current repository. Maybe data isn't unique enough a name, so MyData or PermanentStorage or $DeploymentData (so GesisData in this case) or something which is weird enough that it will hardly ever shadow something from the repository?

One idea floating around --target-repo-dir was to use /somewhere/else as the place to clone the directory to in repo2docker and use nbgitpuller (or a tool with similar semantics) to "move" the contents of the repo into /home/jovyan/repoA from there on launch. Would go someway towards solving the question of "the source repo has been updated, what should we do with the user's directory now?"

bitnik commented 5 years ago

@betatim thanks! I did your 1st suggestion. For now I named it persistent_storage (if a repo contains a file/folder with same name, symbolic link is not created).

But your 2nd suggestion is not clear to me. Persistent volume is already mounted on /home/jovyan, so it overwrites repo folder (cloned by repo2docker) and we already use nbgitpuller to clone and update the repo.

betatim commented 5 years ago

But your 2nd suggestion is not clear to me.

Didn't realise that you were already doing this.

ltetrel commented 5 years ago

This work is really interesting !

We are trying to do the same, the user can upload data from the web https://github.com/SIMEXP/Repo2Data (discussed here jupyter/repo2docker#460) into our server.

@bitnik We have a binder running on our server and were wondering how to "mount" the data in the user's notebook, and how to launch repo2data every time a user upload a new repository. Could you explain more in details how we could do that (for now we are not wondering about authentification) (using https://github.com/gesiscss/example-binderhub-deployments/blob/master/persistent_storage/config.yaml) ? Here is our config file :

jupyterhub:
  ingress:
    enabled: true
    hosts:
      - conp7.calculquebec.cloud
    annotations:
      ingress.kubernetes.io/proxy-body-size: 64m
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: 'true'

  hub:
    baseUrl: /jupyter/
  proxy:
    service:
      type: NodePort
  singleuser:
    memory:
       guarantee: 4G
    cpu:
       guarantee: 2

# BinderHub config
config:
  BinderHub:
    hub_url: https://conp7.calculquebec.cloud/jupyter
    use_registry: true
    image_prefix: cmdntrf/conp7.calculquebec.cloud-

service:
  type: NodePort

storage:
  capacity: 2G

ingress:
  enabled: true
  hosts:
    - conp7.calculquebec.cloud
  annotations:
    kubernetes.io/ingress.class: nginx
  https:
    enabled: true
    type: kube-lego
  config:
    # Allow POSTs of upto 64MB, for large notebook support.
    proxy-body-size: 64m

@betatim we are also thinking on showing the /data folder but in our case we don't really want to show all the details (medical data), so using just headers (with datalad for example) could be a solution.

Thanks to you both,

bitnik commented 5 years ago

@ltetrel as I understand, you don't want to have authentication but you want to mount a data volume into each anonymous user's pod. Then I assume this volume is already filled with some data and it is to be shared with all user pods as readonly, am I right? (I didn't try to do that so far)

And you also want to run repo2data and download requested data (data_requirement.json) every time a user launches a new repo if launched repo contains data_requirement.json. But where do you want to download that data? If into user's home directory, why don't you let user to do that with postBuild (as it is done here https://github.com/bitnik/binder_repor2data)?

I would like to help :) but I am a bit confused. Could you elaborate your goal and maybe we can continue discussing this in another issue?

ltetrel commented 5 years ago

Hi @bitnik and thank for your help :) We can continue the discussion here : https://discourse.jupyter.org/t/mounting-server-data-on-each-users-pod/641

arnim commented 5 years ago

Complementary issue #1003 with some nice ideas

ltetrel commented 5 years ago

Thanks @arnim But in our case we want persistent storage. We got it working by using these ideas here : https://discourse.jupyter.org/t/mounting-server-data-on-each-users-pod/641/4 We have a nfs storage mounted on each node to centralize the data administration and avoid duplication : https://github.com/neurolibre/neurolibre-binderhub/issues/18 We were also thinking to use an initContainer instead of putting repo2data into the config file. This has the advantage of making the process of downloading the data (if needed) more independent (running in a separate container instead).

bitnik commented 4 years ago

I am closing this issue. We can continue discussing this on https://discourse.jupyter.org/t/a-persistent-binderhub-deployment/2865.

jupyterhub / binderhub