Make it possible to configure the base image

choldgraf commented 5 years ago

I feel like we've discussed this a few times but I can't find a specific issue, so:

What if we exposed the ability for CLI users of repo2docker to specify a base image to use instead of the ubuntu base image. I could see this being useful for:

People who want to use a Rocker- or Neurodocker-style image w/ a really nasty stack, and then build on top of it
People who want to build a more usecase-specific BinderHub deployment (e.g. for pangeo, everybody starts w/ an image that has many basic earth analytics packages and then they can specify on top of this.

@craig-willis if you have thoughts on the above that'd be helpful!

I think the trick here is that we'd need to lay down some clear rules for what would need to be in the image. It'd have to run the same Ubuntu version, and would need to have jupyterhub / server stuff ready to go. Perhaps it could be treated as an "advanced use case, you should know what you're doing" kinda thing.

betatim commented 5 years ago

To support a given base image we'd need a set of adjustments/adaptions. For example it could be that instead of apt-get you need to use rpm. Or that the names of the packages to install changes slightly.

How about having different hierarchies of buidpack classes? You'd have a NeuroDockerPythonBuildPack which works with the NeuroDocker base image, etc.

I think having a second, third, n-th well defined stack of build packs for a particular base image has a higher likelihood of working than allowing people to use arbitrary base images by just switching the image named in the FROM statement. If you really want to use arbitrary bases you can alreayd do this by using a Dockerfile. I am not sure how many users of this functionality are out there.

choldgraf commented 5 years ago

interesting idea - one thought: do we currently have a way of people to define a buildpack without merging it directly into repo2docker? I wonder if we could let people define new buildpacks and point to them as a part of a repo2docker build. Neurodocker, for example, is basically doing the same thing as repo2docker (it's a command line tool where you say "I want this packages installed" and it builds a Dockerfile with the relevant lines in there)

betatim commented 5 years ago

We could explore how to use entrypoints to allow for plugins/extensions. I think that would be interesting both for build packs and content providers!

craig-willis commented 5 years ago

I came across a couple of discussions about supporting different distros in https://github.com/jupyter/repo2docker/issues/166 and pinning the distro/base image in https://github.com/jupyter/repo2docker/issues/170.

I think I need to clarify what we're talking about when we say "base image", because there are at least two cases.

Base image is a child of Ubuntu but with some domain/use-case specific packages
Root image is another distro (e.g., Debian, CentOS)

One possible addition/expansion is supporting other versions (i.e., a use case that must have trusty).

It'd have to run the same Ubuntu version, and would need to have jupyterhub / server stuff ready to go.

Does repo2docker require the JupyterHub/server packages, or is this a requirement for BinderHub specifically? Must this be a hard requirement for every Dockerfile produced by repo2docker? Or rather, must every Dockerfile produced by repo2docker be potentially runnable on BinderHub (I think "yes").

If you really want to use arbitrary bases you can already do this by using a Dockerfile

Yes, but repo2docker provides a powerful and simple way for users to create Dockerfiles without needing to know how to. If I need to support a community with a hard CentOS requirement, an alternative is to fork repo2docker or write my own.

I think having a second, third, n-th well defined stack of build packs for a particular base image has a higher likelihood of working than allowing people to use arbitrary base images.

I agree -- and the communities that needs them can help define the requirements for the stack. The constraints that are in place today in repo2docker have still provided plenty of flexibility.

choldgraf commented 5 years ago

The buildpacks idea is nice to me - I suppose that's the primary extension mechanism of r2d already anyway.

re: other distributions of linux, I think it'd be tricky from a testing perspective but can definitely see the potential benefit to other groups that can't choose their OS.

re: jupyterhub, I believe that repo2docker just ensures that there's a jupyter server default comment at the end https://github.com/jupyter/repo2docker/blob/694e728ffd33ef589417e82bd1988e1f8a099fa8/repo2docker/buildpacks/base.py#L145. This doesn't mean jupyterhub per se (though it installs jupyterhub by default so that it could work w/ jupyterhub if needed). (somebody correct me if I'm wrong here)

yuvipanda commented 5 years ago

I like https://github.com/binder-examples/rocker as a pattern we can emulate. @choldgraf does, say, neurodebian already have a set of docker images it maintains? If so we can maybe work to add a binder base image there

choldgraf commented 5 years ago

@yuvipanda I think that in the short term this is a good solution - treat it as a "sort of advanced" use-case but provide docs to show how it's done. Then if it's done often enough and in a repeatable way, consider how to build it into a non-Dockerfile-based pattern. WDYT?

craig-willis commented 5 years ago

This is a great example to illustrate one of the Whole Tale project primary use cases. For context, in Whole Tale, we'd like to use repo2docker but we aren't running JH or BH. Today, we support running both Jupyter and Rstudio images directly.

Ideally, we'd be able to have a base RStudio image such as rocker/geospatial:3.5.0 (or rocker/binder) and allow users to add OS and R packages via the standard buildpacks (e.g., apt.txt, install.R, etc). A reason for us to adopt repo2docker is specifically to avoid users needing to create Dockerfiles.

The rocker example is great, and in essence we want to support every RStudio user this way with standard buildpack support.

craig-willis commented 5 years ago

As discussed in https://github.com/whole-tale/whole-tale/issues/52, I've started integration of repo2docker into the Whole Tale system as a primary image build mechanism. In doing so, I now have a clearer idea of how the Rocker images fit into this discussion and provide a good example of the potential for this capability.

I've written up some notes in a Google doc for comment based largely on this thread: https://docs.google.com/document/d/14VaD5Z-M_sRdIZvuWsuOWpavR_lRv2BxWkZNS7gFEgY/edit#

I've hacked together a template-based proof-of-concept for discussion, if interested:

https://github.com/craig-willis/repo2docker/pull/1

/cc @cboettig @karthik

jhamman commented 5 years ago

Chiming in here from the Pangeo perspective. We've recently found ourselves working around a few repo2docker challenges where configuring the base image would be really helpful. A few examples of what we want to do:

We want to build large conda-based images. Across the pangeo project, most of our environments are nearly identical, just adding a few domain specific packages here and there (read lots of duplication). Building these images generally takes longer than we'd like (mostly conda's fault but we may be able to skip some of the slow steps).
We want to share environments between Binder and Jupyterhub. It would be convenient to store an image on dockerhub (or similar) and use it in both binder/jupyterhub. For complex images, this would really help clarify the differences (or lack thereof) between services.

We have recently been trying out two approaches that touch on these points:

pangeo-stacks is a repository of curated docker images for use in the pangeo project. We use repo2docker to build the images but at this point, there isn't any hierarchical relationship between the images.
We then use published images from pangeo-stacks in binder and in jupterhub. We're using dockerfiles for this. This solves one problem but means extending the images significantly requires working in dockerfiles directly.

I'll throw out a concept for how repo2docker could handle these use cases better.

repo2docker creates/maintains some sort of base image for each build pack. This may actually be a nice way of separating some of the idiosyncrasies of each build pack from the core functions of the tool.
repo2docker adds a config file (#166) that allows setting the base image (among other things).
repo2docker would set default values for these base images to that of the official repo2docker image but anyone could override this value with any image that met some reasonably strict criteria. For example, you could enforce that the base image needs to inherit from the base image for that buildpack.

cc @rabernat, @betatim, and @fmaussion who joined in on the gitter chat this morning.

betatim commented 5 years ago

One fundamental question is the one of who should get to choose which base image to use: the repository (via a config file) or the entity invoking repo2docker (via a command-line argument). It seems like the latter would be less useful than the former. Maybe it is time to give up the resistance to invent a new config file called repo2docker.yaml :-/

I like the idea of restricting which base images you can use. The motivation for this is to allow base images to be configurable and to keep the convenience of repo2docker based image building when using them. Instead of requiring users to write (short) Dockerfiles to do so.

If we allow arbitrary base images we'd not gain much IMHO as users who choose (say) an alpine linux base image would be back to square one in terms of complexity to understand why that base image doesn't work. I like the idea of restricting the set of possible base images to images built by repo2docker. It would alleviate a lot of worries I have about how allowing this would not make users lives easier because of hard to debug incompatibilities.

How would we determine if an image was built by repo2docker? Not sure, but maybe we can check the LABELs applied to see if there is a repo2docker label. I like the idea of checking for labels as this means how the image is actually created doesn't matter. You could create it without ever invoking repo2docker and apply the label to "certify" that this image is a valid repo2docker base image.

Overall I like the idea.

I think that this proposal doesn't let WholeTale do what they want to do which is start from arbitrary base images that were explicitly not constructed by repo2docker (e.g. the rocker images). At the minimum we'd have to add labels to the rocker image to certify it as repo2docker-base-image-compatible.

craig-willis commented 5 years ago

Great discussion!

Based on the earlier discussions, the approach I'm taking with a Whole Tale proof-of-concept is to add a RockerBuildPack that changes some of the base template to work with the rocker community images (FROM and some minor debianisms). I think the notion of "arbitrary base images" is something we can let go of preferring to support "community curated images" (Rocker, NeuroDebian, Pangeo?) -- but these won't necessarily have been built by repo2docker. One thing I liked about @betatim's earlier idea of requiring a buildpack or buildpack hierarchy is that it forces some level of commitment to implement. It sounds like this proposal adds more flexibility -- a user can select any compatible base image.

At the minimum we'd have to add labels to the rocker image to certify it as repo2docker-base-image-compatible.

I can't imagine this would be a problem, but would require the maintainers to buy-in. Although very minor, there are differences in the base template required by Debian that I'd also need to address somehow.

One fundamental question is the one of who should get to choose which base image to use

For our proof-of-concept, the user selects the default "environment" (WT terms) which equates to selecting the default buildpack for their repo. I did initially implement is as a flag on repo2docker because we have a way of storing the additional configuration outside of the repo (or "workspace" in WT land), but a repo2docker.yml would work just as well.

I'm actively working on the WT side of things now and will return to repo2docker soon, if there's interest in collaborating.

jhamman commented 5 years ago

Would people like to see a prototype conda-buildpack that implements some of these ideas? I think we can knock that out in the next few weeks and be ready to discuss by the next monthly jupyter team meeting.

choldgraf commented 5 years ago

@jhamman just chiming in a bit late, but I'd love to see people playing around with these ideas :-)

I like the idea of restricting the set of possible base images to images built by repo2docker.

I agree this would make the whole thing a lot simpler.

yuvipanda commented 5 years ago

I personally love what @craig-willis is doing with RockerBuildPack, and think extensions to repo2docker is the way forward. My blocking concern with making it configurable from inside repo2docker is this will cause extremely hard to debug issues that will make maintaining repo2docker very hard.

If we implement this using the extension mechanism instead, the workflow admins would follow is:

pip install repo2docker
pip install repo2docker-pangeo

Since this will add a buildpack, it can detect it should use the pangeo buildpack, and do whatever it needs to do - even if all it does is change the base image. But if you want a common conda install, you probably aren't going to just change the base image, since that means we'll re-install everything! You would probably set up something a lot more custom...

I am going to spend a couple hours today trying to prototype this with the PANGEO stacks images, and report back.

yuvipanda commented 5 years ago

Alright, I've a fully working prototype based on the pangeo stack! There's a functional README in https://github.com/yuvipanda/repo2docker-pangeo. Try it out and let me know what you think. It currently requires a repo2docker_config.file that's 2 lines long, but we can probably build a discovery mechanism that removes the need for that. The entire code for implementing this is 77 lines long as well.

This is one approach to having specialized plugins - for PANGEO, Rocker, etc. I like this because it gives the power to maintain the plugin directly to the people who are maintaining the specific base images. It also gives them the responsibility, thus reducing burden on core repo2docker itself - both from a maintainer and code complexity perspective.

This keeps the power of which base images can be used (without a Dockerfile) with the people who are running repo2docker. I'm experimenting with a different approach that gives that power to the people who are making the repositories, using ONBUILD. I'll play with it a bit more and put up a prototype.

betatim commented 5 years ago

A comment from a discussion on discourse: Peter pointed us to https://buildpacks.io/

I think repo2docker already has a lot of the ideas that are in pack and there are worse options than copying something else that is popular :) (In my maintainer mind I am already wondering how we can retire repo2docker or make it a thin layer on top of pack ... because having to maintain less code is always better)

One thing I like about the pack tool is that a buildpack decides which base image it uses. This goes along the lines of @craig-willis exampleof an extra build pack that chooses a different base image.

We loose composability or at least it needs careful thinking when creating a new build pack if you can still be composed with others or not. This gives rise to the idea of "stacks of buildpacks".

TL;DR Right now I am in favour of "build packs choose their base image", "build packs decide which stack they are in" and "use entrypoints to allow external packages to contribute buildpacks".

Question (after a quick browse of your code @yuvipanda): My impression is that you implement what I wrote in my TL;DR except for using entrypoints. Instead you insert yourself at the top of the build pack search path via some config magic.

betatim commented 5 years ago

This keeps the power of which base images can be used (without a Dockerfile) with the people who are running repo2docker. I'm experimenting with a different approach that gives that power to the people who are making the repositories, using ONBUILD. I'll play with it a bit more and put up a prototype.

Doesn't your prototype already let creators of repos choose the base image via what they write in npangeo-stack?

yuvipanda commented 5 years ago

@betatim:

Question (after a quick browse of your code @yuvipanda): My impression is that you implement what I wrote in my TL;DR except for using entrypoints. Instead you insert yourself at the top of the build pack search path via some config magic.

Oh absolutely, this isn't a new idea at all. I think a bunch of us also talked about it in a team meeting a few months ago when @craig-willis was there. Just new code. Entrypoints is the next step, but this already works with released repo2docker so makes for a nice demo.

<3 to everyone in this thread for hashing out and moving towards a good set of solutions to a very complex problem!

yuvipanda commented 5 years ago

@betatim

Doesn't your prototype already let creators of repos choose the base image via what they write in npangeo-stack?

Nope it does not. It constrains them to only choosing from PANGEO images. This lets the buildpack make assumptions, such as:

conda is already installed
REPO_DIR is set to something reasonable
And a lot more as we add more features there. For example, we can see which environment variables are available. The base image can also already be pre-composed to have Julia and R, and then we can do different things based on which base image is specified.

yuvipanda commented 5 years ago

TL;DR Right now I am in favour of "build packs choose their base image", "build packs decide which stack they are in" and "use entrypoints to allow external packages to contribute buildpacks".

+1. We need to figure out a way to deal with ordering when inserted via entrypoints, but that's doable.

yuvipanda commented 5 years ago

@betatim I <3 buildpacks.io. A lot of it is straight from s2i, which was what the very first versions of repo2docker were based off of. I wrote https://github.com/yuvipanda/words/blob/master/content/post/why-not-s2i.md at that time when we switched away. TLDR is composability.

yuvipanda commented 5 years ago

https://github.com/yuvipanda/pangeo-stack-onbuild is the other prototype, where stack authors make -onbuild variants of their images. This lets users directly specify which (supported) image they wanna use, and empowers stack authors to support whatever files they wanna support.

This works today on mybinder.org, once I wait for my push of this onbuild image to complete...

yuvipanda commented 5 years ago

https://mybinder.org/v2/gh/yuvipanda/pangeo-stack-onbuild/master works!

It is based off the base-notebook image from PANGEO stack, but lets users customize it simply with an environment.yml file in the repo directory. It also works with all binders right now, without any customization needed on the part of the operators.

betatim commented 5 years ago

Oh absolutely, this isn't a new idea at all.

My comment wasn't in the spirit of "how lame, this ain't a new idea", I wanted to double check that I hadn't missed anything and that my impression was correct.

Nope it does not. It constrains them to only choosing from PANGEO images.

Ah yes. I don't think this is a drawback, more a feature because you said: buildpacks make assumptions about the base image.

We need to figure out a way to deal with ordering when inserted via entrypoints, but that's doable.

Hopefully we can find a simple way for this and construct extra buildpacks so that they play nice with each other (keep the triggers separate so that order doesn't matter so much) and definitely play nice with the base buildpacks. Seems like something we should write down as part of the "entrypoints contract". Something like "you should follow these guidelines and if you don't we can't offer any support to you or users of your buildpack". A bit like we do with Dockerfiles right now: if you use one you are on your own.

TLDR is composability.

Nods. I was interested to see that https://buildpacks.io/docs/using-pack/building-app/ (scroll down to the picture) makes me think that pack can now compose build packs. If we want to chat about this we should probably fork a new thread from this comment on discourse instead of yakking on in this issue.

yuvipanda commented 5 years ago

I'm working with the PANGEO folks to help implement some of this. See https://github.com/pangeo-data/pangeo-stacks/pull/27 for the PR.

You can see a demo here:

https://gist.github.com/yuvipanda/2f139c912f1a4c584bf7e719961a3d02 loaded in mybinder.org as:

https://mybinder.org/v2/gist/yuvipanda/2f139c912f1a4c584bf7e719961a3d02/master

works great! much faster too.

iwilltry42 commented 4 years ago

Hey there, I just digged up this issue while searching for a solution for our very specific problem:

we're providing a JupyterHub + BinderHub setup to datascientists in our company
in their singleuser environments (either created via JupyterHub or built via BidnerHub), they need to connect to our internal Hadoop Cluster (HDFS)
for this connection a lot of configuration has to be set and the correct versions of Hadoop tooling has to be present
this is mostly stuff, that cannot be defined in requirements files that repo2docker supports and would rather require a bunch of scripts
even if we'd manage to install it with a buildpack, it would take ages as it has to download all the specific packages and configuration files

I think there were some really good options in this thread, which could solve this issue. But now there hasn't been an update in a year, this issue is still open and I'm not sure if there was ever a clearly defined way to go.. :thinking:

Can you help me out there? :)

betatim commented 4 years ago

I think this is a contentious issue with many similar but slightly different use-cases. People want to change the base image but for different reasons/to achieve different goals. I think we need to divide and conquer to make any progress.

For your particular problem I'd suggest the following (and I think I'd be happy to merge a PR implementing it but others might have other opinions).

A side comment which might solve your problem or not: there is the option to add an "appendix" to every repo being built: https://repo2docker.readthedocs.io/en/latest/usage.html#cmdoption-jupyter-repo2docker-appendix BinderHub can also specify one. Maybe this is enough already? As the name suggests it is an appendix, not a prependix so you can only do stuff that fits with being done at the end.

Adding a CLI flag that sets the name of the base image might solve your particular problem. You could make your own base image (somehow), publish it and then build all repo2docker images on top of that. I'm thinking of something that gets literally pasted into https://github.com/jupyter/repo2docker/blob/023e577eee68d5567ddf783a56ac32d44fd5b64c/repo2docker/buildpacks/base.py#L17. This would give the "owner" of the repo2docker process full control and responsibility to provide a base image that will work. The fact that not every base image will work with repo2docker is (for me) the main blocker to making this functionality wide spread.

What do you think? (if we want to discuss in more detail maybe we should make a new thread for this specific idea)

jgwerner commented 4 years ago

@betatim @yuvipanda this is really interesting since it could allow users to build images in layers (which of course has other benefits), where they could build one image from another image. Our use-case is to replicate the build mechanism within the jupyter/docker-stacks repo. If a user wants an image with more stuff, then repo2docker works great but the user may have a giant image once they are done.

With this in mind, we could run:

jupyter-repo2docker https://github.com/norvig/pytudes \
    --image-name foo/bar

jupyter-repo2docker https://github.com/norvig/pytudes \
    --build-args BASE_IMAGE=foo/bar \
    --image-name bizz/bazz

For example, with #909 updating the base image to ubuntu 20.04 may work great for most if not all packages but it could very well be the case that the base image needs to be another version of ubuntu or even another base image altogether for package xyz to work correctly.

Another benefit is that if one were to select a more specific image from buildpack-deps, for example, then the user could remove some/all packages with the apt.txt file since they would be included in the base image already.

If there is something we could do to help with this effort let us know (poc, draft wip, etc)!

trybik commented 4 years ago

@betatim I'll throw-in our use case to support this, mostly for consultation. We want users (effectively jovian user), to be able to modify runtime environment; in particular, to run sudo apt-get update/install ... over Terminal in the Notebook Server session but also maybe install other scientific libraries which don't really support easily user installation. Being able to throw-in own image would allow to add jovian to /etc/sudoers, ignoring that: a) the user is created only later, and b) ${NB_USER} is configurable.

Please do correct me if I'm wrong, but, although being able to toy around w/ system packages in a running environment seems like a fairly standard request, it seems no other approach covers this case, except for using explicit Dockerfile, which I can't use here for other reasons.

manics commented 4 years ago

Would a new config file like post-build-admin (or even pre-build-admin?) that runs a script as root solve some problems without the complexity of a new base image?

trybik commented 4 years ago

Would a new config file like post-build-admin (or even pre-build-admin?) that runs a script as root solve some problems without the complexity of a new base image?

Yes, it would; post-build-admin would be better because user would be setup already and I could probably somehow dynamically grab its name (how? some env var?).

trybik commented 4 years ago

Would a new config file like post-build-admin (or even pre-build-admin?) that runs a script as root solve some problems without the complexity of a new base image?

Yes, it would; post-build-admin would be better because user would be setup already and I could probably somehow dynamically grab its name (how? some env var?).

Note: I'm aware this would imply whole bunch of bad usage practices like e.g. discussed in https://github.com/jupyterhub/repo2docker/issues/192

manics commented 4 years ago

Note: I'm aware this would imply whole bunch of bad usage practices like e.g. discussed in #192

That's true, but at least all the bad practices are confined to one file so it's easy to know where to find them when dealing with a broken image, and we could say post-build-admin has the same support level as Dockerfile, i.e. not much.

trybik commented 3 years ago

Just bookkeeping another case for postBuildAdmin: https://gitter.im/jupyterhub/binder?at=5fce35b6fb7f155587ad481a (this one could be also covered by preBuildAdmin).

holzman commented 2 years ago

One important use case: with dockerhub pull limits, it'd be very useful to redirect the build image to a registry cache. We can support this by just setting the build image via ARG. If that's too controversial, would there be any issues with supporting setting a prefix for the build image?

westurner commented 1 year ago

From https://github.com/jupyterhub/repo2docker/commit/20b081525785fdc6ac524e06150028796b8787ec :

👍

Is there a list of base images?

Is there a reference RPM-based base image?

yuvipanda commented 1 year ago

RPM based images are not supported. I would probably say that only Ubuntu versions are supported. And in general if it breaks, you kinda get to keep the pieces.

jupyterhub / repo2docker

Make it possible to configure the base image #487