One-stop location for Jupyter stacks

parente commented 9 years ago

I'm writing this up as an issue because I'm not sure if there's a more appropriate place to post it. If there is, please say the word and I'll move it there.

Proposal

I propose that the Jupyter project have a place to host pre-configured Jupyter "stacks". These stacks would include Dockerfiles and associated resources that facilitate getting started with Jupyter projects. For example, consider the following non-exhaustive list of possibilities:

Jupyter Notebook pre-configured with a package manager and common packages for scientific computing in Python
Jupyter Notebook pre-configured with a package manager and common packages for scientific computing in R
Jupyter Notebook pre-configured with a package manager and common packages for scientific computing in [insert language name here]
Jupyter Notebook configured with everything plus the "kitchen sink" as a demonstration of its capabilities (i.e., the current Docker jupyter/demo repository, but without the tmpnb.org configuration)
Jupyter Notebook configured with the "kitchen sink" plus the few additional files and configurations for running on try.jupyter.org
Jupyter Notebook pre-configured with Spark in local mode and the languages it supports
Jupyter console like the above
Jupyter Hub pre-configured with OAuth
etc.
Rationale

The Docker community is making great strides in ease of running complicated software stacks both locally and in the cloud. The Jupyter community can leverage Docker to facilitate trying out various Jupyter stacks, especially if the community offers:

Docker images that the community hosts in the cloud (e.g., try.jupyter.org)
Docker images that users can host themselves, either locally (e.g., docker-machine create -d virtualbox, docker run jupyter/), or in the cloud (e.g., docker-machine create -d , docker run jupyter/)

Possible Approaches
1. Restructure this git repository (github.com/jupyter/docker-demo-images) so that it may contain stacks other than the one for try.jupyter.org.
Pros: It already exists, it has the initial image for try.jupyter.org. Cons: It needs re-org and may never have been intended to host more than try.jupyter.org. The name jupyter/docker-demo-images is also misleading if these images can be used for work, not just experimentation purposes.
1. Create a new git repository github.com/jupyter/docker-stacks to host the proposed stack definitions.
Pros: It's a fresh start and can be structured around this proposal. Cons: If the source for the Docker jupyter/demo repository already lives in github.com/jupyter/docker-demo-images, it's immediately out of place if the new github.com/jupyter/docker-stacks exists (unless the former is somehow special and belongs in its own GitHub repo.)

In either case, the goal would be to have Docker Hub automatically build and host images for all of the Dockerfiles contained within the GitHub repository.

Restructure Proposal for Option 1

If the community decides to go with option 1, a strawman for reorganizing the github.com/jupyter/docker-demo-images repository could be:

# definition of a common base image for all images that can be built by this repo (does this even make sense?)
common/
    profile_default/
    Dockerfile

# based on common with the notebooks, datasets, branding, etc. specific to try.jupyter.org
tmpnb-demo/
    datasets/
    notebooks/
    templates/
    profile_default/
    Dockerfile

# other example stacks that might be based on common as well
pyspark-notebook/
scala-spark-notebook/
scipy-notebook/
scipy-console/
jupyterhub-github-oauth/
...

Proposal for Option 2 Git Repo Structure

Create github.com/jupyter/docker-stacks. As the set of Docker repositories in this project grows, perhaps a common base will emerge. Until that time, it's not clear whether there is a base that could/should be reused across stacks. So in this proposal, the git structure would simply start off as a set of folders, one per stack, with Dockerhub pointing to the Dockerfile in each subfolder to build a Docker repository on commit to master.

# no required common base, each folder hosts the Dockerfile and assets for the stack
tmpnb-demo/
pyspark-notebook/
scala-spark-notebook/
scipy-notebook/
scipy-console/
jupyterhub-github-oauth/
...

Of course, I'm open to other options than what I pitched here. These are just to kick start a conversation. I'm also happy to help out with the work in any case.

Carreau commented 9 years ago

I think one of the problem is that people want stacks to be composable, and the bigger the stack is, the bigger the stack is, the longer it takes to rebuild.

It would be nice to find a way for the stacks of each kernel to exists separately and find a way to install all the stack and merge them (nfs mount ?) on the final system.

parente commented 9 years ago

Composable stacks would be cool, but having tried coposition-not-inheritence with Docker in other contexts, I'm wary. When OS-level packages are in the mix, it's hard to get a clean separation for mount point tricks. For example, if IRuby needs gsl-config but IPython does not, ideally apt-get install libgsl0-dev only resides in the IRuby container. But libgsl0-dev installs stuff down in /usr/lib which, of course, is also the target for other dependencies from IPython, R, Julia, etc. which all need to be mixed in.

I have seen some trickery played with composing Docker images by mixing and matching the filesystem layers across Docker repositories, but it's hacky and definitely not something that Docker Hub is going to automatically do if automated builds are a requirement.

Failing some new container magic that enables composition, the robust option that works today is to grow an ecosystem of container options. If someone contributes a pure Python stack and someone else contributes a pure R stack, but you really want a Python + R stack, yes, you have to join the two Dockerfiles in a sane way. But after the initial build to make sure things are functioning for a PR, all the rebuild time problems should fall on Docker Hub if the automation is configured correct.

All that said, I'd still think it would give potential Jupyter users a leg up to be able to easily docker run some simple stacks to start and see where demand leads from there.

rgbkrk commented 9 years ago

First off, I'm a big fan of breaking these out like this. People have come to expect reasonable collections and we've outgrown what we put in place as experiments in the past.

For some background history, "common" (which turns into jupyter/minimal) was put together to save (re)build time on just the base notebook layer for tmpnb.org / try.jupyter.org The name is a misnomer broadly but makes sense in the context of just the try jupyter demo images. The jupyter/minimal name needs to go, but the structure of this repo should stay the same. It could undergo a name change and doc update if it helps alleviate confusion though.

Originally, our stacks were in ipython/docker-notebook (still are). Most of what held me back from migrating was letting the big split happen.

/cc @minad who has an IRuby demo

minad commented 9 years ago

@rgbkrk We have the iruby installation in a dockerfile here: https://github.com/SciRuby/sciruby-notebooks/blob/master/Dockerfile

I think it would make sense to provide different Dockerfiles for the different environments. But for try.jupyter.org I would try to keep a single image.

Does docker provide something like includes? This would allow to split the Dockerfiles without duplicating code and leading to a nice composition as @parente mentioned.

parente commented 9 years ago

@minad Docker doesn't support the concept of includes in a Dockerfiles, unfortunately. You can preprocess Dockerfiles to expand includes into regular commands, but then you can't easily take advantage of the build automation provided by Docker Hub which only works with pure Dockerfiles.

To get the ball rolling, I'm suggesting we just define one place for "official" Dockerfiles for stacks to live. I don't think they even need to have a common base to start. We can refactor over time as the set grows.

@rgbkrk When you say migrating, do you mean folding the contents of ipython/docker-notebook into this repo? Or renaming that repo to something like jupyter/docker-stacks and migrating the kitchen sink trynb image there? (These are what I was driving at with option 1 and option 2 in the proposal.)

rgbkrk commented 9 years ago

I'll go ahead and create the jupyter/docker-stacks repo for submitting PRs to

rgbkrk commented 9 years ago

All set. For now I'd leave this repo the same and make the docker-stacks be separate.

parente commented 9 years ago

That's fine by me. I'll move over to submitting an initial set of PRs for that repo. I agree that it doesn't make sense to disrupt this repo until we see where stacks goes (if anywhere).

ellisonbg commented 9 years ago

With all of the different kernels and language packages, it really does seem like it would be very helpful to have some level of composability for containers. We have been using ansible for deployment (no docker) on our cluster at Cal Poly. What about building a set of discrete ansible scripts (base jupyter + 1 for each kernels) and then write a command line tool for consuming yaml config files for generating docker build files? The other benefit is that these same ansible scripts would still work on non-docker deployments.

Other ideas?

parente commented 9 years ago

I was picturing this GitHub repository as being the source for a set of easy-to-consume Docker images that would-be Jupyter users didn't have to build themselves. Rather, users would setup or have ready a Docker host (e.g., with Kitematic or docker-machine) and then run commands like:

docker run -d -P jupyter/scipy-notebook
# or 
docker run -d -P jupyter/r-notebook
# or 
docker run -d -P jupyter/pyspark-notebook
# or 
docker run -d -P jupyter/all-spark-notebook

Or use the Kitematic or another UI to accomplish the same. My assumption, of course, is that Docker is becoming more and more ubiquitous.

I understand the tradeoff for this "just docker run" simplicity is that if a user wants a R+Python environment, one has to exist or the user has to compose it him/herself. At that juncture, it makes sense to have some tooling to help the user do that composition. I'm just wary of requiring the user to run scripts to generate Dockerfiles to build images him/herself only to then start using Jupyter. It feels like too much work for users initially.

Maybe we're driving at two different offerings?

A set of simple pre-built images for common stacks. (walk-up-and-use for beginners)
A level of automation to help users compose their own stacks from piece-parts

Taking 2 to the extreme, a web service like http://spark-notebook.io/ but for Jupyter would be cool.

rgbkrk commented 9 years ago

There's a lot less ideology and setup to pitch with docker. It makes sense to give out opinionated stacks, and people have been using the ones we currently have.

Ansible roles and playbook a can exist but it's a wholly separate entity to me.

ellisonbg commented 9 years ago

Peter, very much synching with your vision here.

One of the ongoing pain points for Jupyter users is getting non-python kernels installed and running. I am working on a getting started page for the main jupyter docs and I think it would be great to have an image we can recommend to users that has Python, R, Julia, Scala/Spark. I also agree that I don't want users to have to muck around with that stuff. It should just work.

For prebundled container stacks that users can "just rurn", I think we do want to simply have a repo of those (officially blessed) with good discoverability.

The question of assembling docker images with the right components is something that is more relevant to those administering jupyterhub deployment or wanting to create custom stacks (I find myself in that group). I talked with @rgbkrk a bit today about that and I see two main directions for that:

We create our own simple tool chain that can munge multiple docker build files together. Kyle was -1 on that, I am probably +0.5 because we could do this very quickly.
We bundle each kernel in a separate container and write Jupyter kernelspecs that can start kernels in those containers. I think this is a very solid direction.

@minrk - now that kernel specs are just a command line program that is run, shouldn't it be pretty simple to write a kernel speck that calls docker run to start a kernel in a container? Any ideas on how to get the connection file back from the inside?

minrk commented 9 years ago

I don't think kernelspecs are the level where docker or not should be done, but it is possible to cram it in there. I think a custom KernelManager at the Python level is a better way to implement that (it's the same as Spawner in JupyterHub, Launcher in IPython parallel, etc.). I was working with the Brookhaven folks to rework the remotekernel stuff to stop being custom kernelspecs, and be a custom KernelManager instead, which is a lot cleaner.

parente commented 9 years ago

PR strawman: https://github.com/jupyter/docker-stacks/pull/1

mattwg commented 9 years ago

Adding a "user" perspective here - it was relatively painless to get jupyterhub up and running to use docker-spawner here on our openstack cloud at eBay (Jess was a big help!). However I am now wondering how I can allow users to select a docker container with the specific stack they need - rather than a kitchen sink model with everything thrown in. The awesome flow for me would be to allow users to select a notebook flavor when they sign into the hub. Admins should be able to add more flavors by simply pointing at an image on docker hub which gets installed. I would happily invest time figuring out how to make a docker image if I knew it could be easily deployed to my jupyter-hub and that users would have choices. I like what the http://mybinder.org/ project is doing.

rgbkrk commented 9 years ago

Guess we can now close this since jupyter/docker-stacks now exists.

jupyter / docker-demo-images