jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 74 forks source link

Some steps to explore supporting a "default environment" #1474

Open choldgraf opened 4 years ago

choldgraf commented 4 years ago

In a recent conversation on gitter @betatim and I brainstormed some ideas about how we could support "default environments" better. E.g., an environment that users don't have control over, but that they can use in combination with their files.

Here are a few steps that we could take - sharing them here in case others have thoughts/comments, and so we don't forget :-)

  1. ~Create a repository that we think has an environment that covers 90% of "I just want it to work, and fast" use-cases~
    • Decided this probably wasn't a good idea from a maintenance perspective, and we should probably instead use docker-stacks image, or something another org maintains like the kaggle image.
  2. Document nbgitpuller functionality more cleanly, recommending that people use this repository as a default
  3. Add a form to do this at the nbgitpuller docs (or the Binder docs?) semi-automatically and advertise it / document it (see https://github.com/jupyterhub/nbgitpuller/issues/125 for reference)
  4. Wait and see how people use these steps, collect some data about it
  5. Explore adding a "default environment" to BinderHub (or to mybinder.org), decide if it's a good idea
  6. Explore adding a short-hand for the nbgitpuller pattern we documented in 2, decide if it's a good idea
jhamman commented 4 years ago

cc @scottyhq and @rabernat who have been pushing on this on the Pangeo side for a while now.

choldgraf commented 4 years ago

would love some thoughts from y'all 👍

rabernat commented 4 years ago

It's great and necessary to be able to support arbitrary environments. But I believe that the scipy ecosystem would be well served by standardizing around a smaller set of common environments, maintained and updated via CI, and released on roughly a monthly frequency. These could then be reused in cloud-based hubs, binders, and local environments.

This is what we are haltingly moving towards in Pangeo. See for example,

choldgraf commented 4 years ago

Just a note that there's prior art here from the R community too:

https://www.rocker-project.org/ runs "community images for the R community" that many sub-communities then build off of. In fact, the holepunch project basically replicates the functionality we're describing here (though in that case, specifically for the R community)

also re: standardizing on a set of images, I agree w/ that - though I don't think the Binder team wants to be ones in charge of that curation just from a maintenance and organizational perspective

choldgraf commented 4 years ago

I took a look at the jupyter-stacks, and they seem to have some pretty nice images across a few languages already:

image

https://github.com/jupyter/docker-stacks

I wonder what are the steps to make those images "binder ready". Maybe @minrk or @parente have ideas?

scottyhq commented 4 years ago

Hi @choldgraf - apologies for not doing a search beforehand, so maybe this is already in a separate issue or has come up in the past.

Is it possible for BinderHub to bypass the linked image registry and pull directly from DockerHub?

I think this would be really useful for "default environments" for a couple reasons. For example, we want to use pangeo/pangeo-notebook:latest as the default binder image, and so we put that single line in a repo Dockerfile and generate a binder link with nbgitpuller:

1) This works as expected the first time, but takes longer than you might expect because the image is pulled to a binder build pod and then pushed back to the linked registry with a corresponding repo sha tag (if I'm not mistaken).

2) A problem/inconvenience currently is if the 'latest' image is updated in a separate repo like pangeo-docker-images or jupyter/docker-stacks, the binder image remains outdated because the binder-enabled repo does not have a new commit.

Of course, there is an issue with using "latest" images for reproducibility. But, it is very convenient for specifying the most up-to-date image without constantly updating explicit tags.

betatim commented 4 years ago

@scottyhq I think it would be better to start a forum thread for your question about images that inherit from fobar:latest and how to update them when foobar is updated. I think the answer(s) to that question are too big to also have in this thread :D

I don't think this thread is a good place to discuss if/which/if not/where the data science community should maintain a unified image. That is such a big question with a lot of trade-offs and even if we decided "yes let's do it" I'd vote for mybinder.org to wait of the order of 6months before adopting it for the "default env" use case. This is because we want to follow not lead. I think it would be a good topic to do some research into via a forum thread. What would be in such an image, who maintains it, has someone tried this before, is it already happening, etc.

are the docker-stacks images "binder ready"?

https://mybinder.org/v2/gh/jupyter/docker-stacks/master?filepath=README.ipynb makes me think the answer is yes. This uses a Dockerfile with nearly no content that inherits from one of the simplest docker-stacks images. I'd expect the images with more packages that inherit from this image to also work (or need minimal customisation).

My vote would be to use one of the docker stacks images. Ideally one with some "data science" stuff already installed, not one of the base ones. Concretely I'd vote for the datascience-notebook image.

minrk commented 4 years ago

Some background on docker-stacks: maintaining those stacks has proven to be a massive maintenance challenge and we haven't been able to keep up, in no small part because we have to frequently make arbitrary decisions about "does X package belong in Y stack?" and "should we support X use case?" Plus, the layering and inheritence of those as a family of images instead of independent stacks, on top of the growing variety of contexts they support (setting uid at runtime, user install permissions or not, etc.) makes them super complicated and huge. This has proven unsustainable, and we have been pushing for most docker-stacks users to switch to repo2docker instead, since it's vastly simpler and easier to maintain and control.

At the very least, I think we should be switching the docker-stacks maintenance to repo2docker builds of specific environment.yml files rather than custom Dockerfiles, which in turn suggests that we should be maintaining repo2docker 'stack' repos (or one repo with subdirectories—after adding subdirectory support to repo2docker) and deprecating docker-stacks as it is.

A warning though: if we start maintaining our own "binder" stacks where each stack is a single environment.yml, we will give ourselves the exact same "who gets to decide what's in a stack?" problem that plagues docker-stacks.

parente commented 4 years ago

I believe maintaining the docker-stacks got significantly better once we outlined the scope of the project and how the community could contribute (e.g., recipes and how to contribute them, community stacks and a place to list them, selection criteria for new features). That scoping happened relatively late in the 5 year project lifespan and so there are a significant number of packages, startup scripts, docker args, etc. that we now maintain in the name of stability in a wide variety of environments (e.g., local use, k8s, jupyterhub). I believe the images in the docker-stacks project are sustainable as-defined for the use cases they currently support, but see any extension of that scope as untenable (e.g., new images, new container runtime environment support, new startup hooks, new permission models).

To that end, I agree with @minrk on three points:

cc: @romainx who has been helping maintain the stacks in the past 9 months and may have other insights to share

choldgraf commented 4 years ago

Note - we've now got Binder support in the nbgitpuller.link page:

https://discourse.jupyter.org/t/how-to-reduce-mybinder-org-repository-startup-time/4956/16

For example, here's an nbgitpuller form link with the repository already filled out:

nbgitpuller.link/?tab=binder&repo=https://github.com/binder-examples/requirements

manics commented 4 years ago

In case you weren't aware, there's currently a problem with nbgitpuller.link: https://github.com/jupyterhub/nbgitpuller/issues/130

choldgraf commented 4 years ago

Thanks for the heads up, should be fixed by https://github.com/jupyterhub/nbgitpuller/pull/134

minrk commented 4 years ago

Nice, with all that, a rough plan could be:

  1. figure out what belongs in such a stack, and clearly scope what it's for and what it's not for, so we can comfortably maintain
  2. use a standard repo2docker repo (i.e. an environment.yml with the agreed-upon stuff) and nbgitpuller, and document this use case. This alone should improve typical experience if this repo becomes popular enough. It will at least eliminate build time from the user experience, as the repo will ~never build except when the stack is updated. Using r2d+nbgitpuller means that there is a relatively small cost to folks who want to maintain their own different stack - no pool optimization for them, but the rest works the same
  3. put a link to nbgitpuller.link on the main mybinder.org page to promote the use case?
  4. investigate pooling for this one repo

I see three main strategies to implementing pooling for Binder repos:

  1. It could basically be tmpnb exactly as it was, but using the BinderHub API instead of the JupyterHub API. That is, a separate application backed by BinderHub rather than a feature inside BinderHub. The advantage of this one is that it could be prototyped right now as a standalone service, without any changes to the existing service.*
  2. Internal pooling support in BinderHub, so it's transparent to all APIs and everything, just an optimization of how launches behave for this one (or few?) repos. The challenge here is syncing pool state for what has been a thus-far stateless BinderHub application (we have two replicas, and they need to share the pool somehow without any race conditions or leaks). This was nontrivial for a single-process tmpnb, so might be a bit of a challenge.
  3. Internal pooling support at the Spawner level, i.e. the pods are running, but JupyterHub is not aware of them, and Spawner.start() pulls one from the pool very quickly. The challenge here would be figuring out the URLs and such that aren't typically known before start.

Option two seems like the best balance, but it has its challenges.

* Something we'll have to figure out is idle-culling. If we are spawning servers that JupyterHub is aware of and leaving them idle, we'll need to make sure the culler doesn't shut them down while they are waiting in the pool, but does shut them down when they become idle 'for real'.

betatim commented 4 years ago

Thanks a lot for the "experience report" on docker stacks. I agree we should make our own with a very clear scope (no binder-stacks, just one binder-stack :D).

I like the strategy you proposed Min. One tweak/variation, an option 2.5 maybe. What about providing a simple webservice that takes a URI like /gh/org/repo/blob/master/ (basically the first part of a github.com URL) and generates the appropriate mybinder.org/v2/gh/org/repo/...?gitpuller... link from it and redirects the user there with a 302?

The motivation for it is:

  1. nicer URLs and one day maybe full support for also adding a filepath/urlpath to the generated URL so that documents are directly opened
  2. doesn't create the expectation that the environment can be selected by the creator of the URL

If this sees a lot of use people would get a lot of speed improvements just based on the fact that they are using a shared image already. Then in a second step we could add the pooling from (2) later as an optimisation that makes launch times even faster. But we wouldn't have to solve the hard problem of "stateless but not pooling" first, we can solve it later.

choldgraf commented 4 years ago

I just put together a little prototype to see how this feels with our current docker stacks. I made this repo:

that simply pulls the datascience notebook and then pip installs nbgitpuller.

https://github.com/binder-examples/jupyter-stacks-datascience

Now we can create partially-filled nbgitpuller links where users only need to add their content repo and then their mybinder.org link is ready:

nbgitpuller.link?tab=binder&repo=https://github.com/binder-examples/jupyter-stacks-datascience

re: @minrk's comments, I think those all sound great. Definitely +1 on using a "regular" repository that is compatible with Binder instead of a Dockerfile. I'd also recommend being fairly strict about "we will not add your special library just because you want it", because we don't want to create a whole new maintenance chain for us and replicate the challenges that docker-stacks have already had.

betatim commented 4 years ago

👍 on marking the default env as "might change without notice, no contributions or issues please. If you need control over your environment please ship your own." is the way to go. We might open it up at a later point or devise a regular schedule for compatibility/maintenance/etc but I'd start with something which is clearly marked as "beta" (in the original sense, not the google sense).

I think naming of the "product" or feature can help set expectations. So for example I'd not call it "default env" and instead maybe something like "scratchpad". Something like "default env" makes me think it is "recommended" or "where you should start" or "what you should use if you don't know better". I'd want my packages in the "default env" because it is what is used by default, etc. A bit like "native support for X" is (somehow) better than "X is not natively supported". A scratchpad sounds more like a temporary thing, that you might use to try something out and then bin it. It is temporary and for trying stuff.

As input for pondering what could/should be installed this is what pip freeze says when you run it in the current colab environment.

choldgraf commented 4 years ago

I'm also +1 on justifying the things that go into the environment by simply referring to some other popular environment (Kaggle and Colab seem like the obvious ones here)

I like scratchpad. I actually use Binder for this fairly often, funnily enough. If I just wanna quickly try out a vanilla API or something it's faster for my to go to a mybinder.org link since it's cached in my browser history.

jgwerner commented 4 years ago

I thought I would chime in since my PR was linked to this issue and share why and how we combine repo2docker and jupyter/docker-stacks for our use-case.

jupyter/docker-stacks

repo2docker

We decided to fuse the best of both projects and build our docker-stacks images using the following general steps below:

  1. Build a standard repo2docker image
  2. Configure a multi-stage build to:
    • Build using the repo2docker image as the base image
    • Copy all start-* files, fix-permissions, and jupyter_notebook_config.py from the jupyter/base-notebook into the repo2docker based image.
    • Create the folders required for hooks and set permissions accordingly.
    • Update env vars based on the jupyter/docker-stacks definitions, such as CONDA_DIR, NB_GID, etc.
    • Replace the repo2docker script with tini as an entry point. (Not really necessary but did it to ensure we were in-line).
    • Set default image command to start-notebook.sh.

The advantage of this solution is that we have two images available, one compatible with BinderHub out of the box with repo2docker, and then another that is compatible with the upstream jupyter/docker-stacks images by adding a smallish layer to the repo2docker based image. This way the end-user can determine what packages and kernels they would like to add to their base image without a lot of fuss, and additionally have the option of taking advantage of the jupyter/docker-stacks features with local docker runs or with JupyterHub.

meeseeksmachine commented 3 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/variable-startup-times-with-a-rstudio-based-binder-example/9172/4

meeseeksmachine commented 3 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/use-published-docker-image-for-binder/10333/3