Closed choldgraf closed 1 year ago
To support a given base image we'd need a set of adjustments/adaptions. For example it could be that instead of apt-get
you need to use rpm
. Or that the names of the packages to install changes slightly.
How about having different hierarchies of buidpack classes? You'd have a NeuroDockerPythonBuildPack
which works with the NeuroDocker base image, etc.
I think having a second, third, n-th well defined stack of build packs for a particular base image has a higher likelihood of working than allowing people to use arbitrary base images by just switching the image named in the FROM
statement. If you really want to use arbitrary bases you can alreayd do this by using a Dockerfile
. I am not sure how many users of this functionality are out there.
interesting idea - one thought: do we currently have a way of people to define a buildpack without merging it directly into repo2docker? I wonder if we could let people define new buildpacks and point to them as a part of a repo2docker build. Neurodocker, for example, is basically doing the same thing as repo2docker (it's a command line tool where you say "I want this packages installed" and it builds a Dockerfile with the relevant lines in there)
We could explore how to use entrypoints to allow for plugins/extensions. I think that would be interesting both for build packs and content providers!
I came across a couple of discussions about supporting different distros in https://github.com/jupyter/repo2docker/issues/166 and pinning the distro/base image in https://github.com/jupyter/repo2docker/issues/170.
I think I need to clarify what we're talking about when we say "base image", because there are at least two cases.
One possible addition/expansion is supporting other versions (i.e., a use case that must have trusty
).
It'd have to run the same Ubuntu version, and would need to have jupyterhub / server stuff ready to go.
Does repo2docker require the JupyterHub/server packages, or is this a requirement for BinderHub specifically? Must this be a hard requirement for every Dockerfile produced by repo2docker
? Or rather, must every Dockerfile produced by repo2docker be potentially runnable on BinderHub (I think "yes").
If you really want to use arbitrary bases you can already do this by using a Dockerfile
Yes, but repo2docker
provides a powerful and simple way for users to create Dockerfiles without needing to know how to. If I need to support a community with a hard CentOS requirement, an alternative is to fork repo2docker or write my own.
I think having a second, third, n-th well defined stack of build packs for a particular base image has a higher likelihood of working than allowing people to use arbitrary base images.
I agree -- and the communities that needs them can help define the requirements for the stack. The constraints that are in place today in repo2docker have still provided plenty of flexibility.
The buildpacks idea is nice to me - I suppose that's the primary extension mechanism of r2d already anyway.
re: other distributions of linux, I think it'd be tricky from a testing perspective but can definitely see the potential benefit to other groups that can't choose their OS.
re: jupyterhub, I believe that repo2docker just ensures that there's a jupyter server default comment at the end https://github.com/jupyter/repo2docker/blob/694e728ffd33ef589417e82bd1988e1f8a099fa8/repo2docker/buildpacks/base.py#L145. This doesn't mean jupyterhub per se (though it installs jupyterhub by default so that it could work w/ jupyterhub if needed). (somebody correct me if I'm wrong here)
I like https://github.com/binder-examples/rocker as a pattern we can emulate. @choldgraf does, say, neurodebian already have a set of docker images it maintains? If so we can maybe work to add a binder base image there
@yuvipanda I think that in the short term this is a good solution - treat it as a "sort of advanced" use-case but provide docs to show how it's done. Then if it's done often enough and in a repeatable way, consider how to build it into a non-Dockerfile-based pattern. WDYT?
This is a great example to illustrate one of the Whole Tale project primary use cases. For context, in Whole Tale, we'd like to use repo2docker
but we aren't running JH or BH. Today, we support running both Jupyter and Rstudio images directly.
Ideally, we'd be able to have a base RStudio image such as rocker/geospatial:3.5.0
(or rocker/binder
) and allow users to add OS and R packages via the standard buildpacks (e.g., apt.txt, install.R, etc). A reason for us to adopt repo2docker
is specifically to avoid users needing to create Dockerfiles.
The rocker example is great, and in essence we want to support every RStudio user this way with standard buildpack support.
As discussed in https://github.com/whole-tale/whole-tale/issues/52, I've started integration of repo2docker into the Whole Tale system as a primary image build mechanism. In doing so, I now have a clearer idea of how the Rocker images fit into this discussion and provide a good example of the potential for this capability.
I've written up some notes in a Google doc for comment based largely on this thread: https://docs.google.com/document/d/14VaD5Z-M_sRdIZvuWsuOWpavR_lRv2BxWkZNS7gFEgY/edit#
I've hacked together a template-based proof-of-concept for discussion, if interested:
https://github.com/craig-willis/repo2docker/pull/1
/cc @cboettig @karthik
Chiming in here from the Pangeo perspective. We've recently found ourselves working around a few repo2docker challenges where configuring the base image would be really helpful. A few examples of what we want to do:
We have recently been trying out two approaches that touch on these points:
I'll throw out a concept for how repo2docker could handle these use cases better.
cc @rabernat, @betatim, and @fmaussion who joined in on the gitter chat this morning.
One fundamental question is the one of who should get to choose which base image to use: the repository (via a config file) or the entity invoking repo2docker
(via a command-line argument). It seems like the latter would be less useful than the former. Maybe it is time to give up the resistance to invent a new config file called repo2docker.yaml
:-/
I like the idea of restricting which base images you can use. The motivation for this is to allow base images to be configurable and to keep the convenience of repo2docker based image building when using them. Instead of requiring users to write (short) Dockerfile
s to do so.
If we allow arbitrary base images we'd not gain much IMHO as users who choose (say) an alpine linux base image would be back to square one in terms of complexity to understand why that base image doesn't work. I like the idea of restricting the set of possible base images to images built by repo2docker. It would alleviate a lot of worries I have about how allowing this would not make users lives easier because of hard to debug incompatibilities.
How would we determine if an image was built by repo2docker? Not sure, but maybe we can check the LABEL
s applied to see if there is a repo2docker label. I like the idea of checking for labels as this means how the image is actually created doesn't matter. You could create it without ever invoking repo2docker and apply the label to "certify" that this image is a valid repo2docker base image.
Overall I like the idea.
I think that this proposal doesn't let WholeTale do what they want to do which is start from arbitrary base images that were explicitly not constructed by repo2docker (e.g. the rocker images). At the minimum we'd have to add labels to the rocker image to certify it as repo2docker-base-image-compatible.
Great discussion!
Based on the earlier discussions, the approach I'm taking with a Whole Tale proof-of-concept is to add a RockerBuildPack
that changes some of the base template to work with the rocker
community images (FROM
and some minor debianisms). I think the notion of "arbitrary base images" is something we can let go of preferring to support "community curated images" (Rocker, NeuroDebian, Pangeo?) -- but these won't necessarily have been built by repo2docker
. One thing I liked about @betatim's earlier idea of requiring a buildpack or buildpack hierarchy is that it forces some level of commitment to implement. It sounds like this proposal adds more flexibility -- a user can select any compatible base image.
At the minimum we'd have to add labels to the rocker image to certify it as repo2docker-base-image-compatible.
I can't imagine this would be a problem, but would require the maintainers to buy-in. Although very minor, there are differences in the base template required by Debian that I'd also need to address somehow.
One fundamental question is the one of who should get to choose which base image to use
For our proof-of-concept, the user selects the default "environment" (WT terms) which equates to selecting the default buildpack for their repo. I did initially implement is as a flag on repo2docker
because we have a way of storing the additional configuration outside of the repo (or "workspace" in WT land), but a repo2docker.yml
would work just as well.
I'm actively working on the WT side of things now and will return to repo2docker
soon, if there's interest in collaborating.
Would people like to see a prototype conda-buildpack that implements some of these ideas? I think we can knock that out in the next few weeks and be ready to discuss by the next monthly jupyter team meeting.
@jhamman just chiming in a bit late, but I'd love to see people playing around with these ideas :-)
I like the idea of restricting the set of possible base images to images built by repo2docker.
I agree this would make the whole thing a lot simpler.
I personally love what @craig-willis is doing with RockerBuildPack
, and think extensions to repo2docker is the way forward. My blocking concern with making it configurable from inside repo2docker is this will cause extremely hard to debug issues that will make maintaining repo2docker very hard.
If we implement this using the extension mechanism instead, the workflow admins would follow is:
pip install repo2docker
pip install repo2docker-pangeo
Since this will add a buildpack, it can detect it should use the pangeo buildpack, and do whatever it needs to do - even if all it does is change the base image. But if you want a common conda install, you probably aren't going to just change the base image, since that means we'll re-install everything! You would probably set up something a lot more custom...
I am going to spend a couple hours today trying to prototype this with the PANGEO stacks images, and report back.
Alright, I've a fully working prototype based on the pangeo stack! There's a functional README in https://github.com/yuvipanda/repo2docker-pangeo. Try it out and let me know what you think. It currently requires a repo2docker_config.file that's 2 lines long, but we can probably build a discovery mechanism that removes the need for that. The entire code for implementing this is 77 lines long as well.
This is one approach to having specialized plugins - for PANGEO, Rocker, etc. I like this because it gives the power to maintain the plugin directly to the people who are maintaining the specific base images. It also gives them the responsibility, thus reducing burden on core repo2docker itself - both from a maintainer and code complexity perspective.
This keeps the power of which base images can be used (without a Dockerfile) with the people who are running repo2docker. I'm experimenting with a different approach that gives that power to the people who are making the repositories, using ONBUILD. I'll play with it a bit more and put up a prototype.
A comment from a discussion on discourse: Peter pointed us to https://buildpacks.io/
I think repo2docker already has a lot of the ideas that are in pack
and there are worse options than copying something else that is popular :) (In my maintainer mind I am already wondering how we can retire repo2docker or make it a thin layer on top of pack
... because having to maintain less code is always better)
One thing I like about the pack
tool is that a buildpack decides which base image it uses. This goes along the lines of @craig-willis exampleof an extra build pack that chooses a different base image.
We loose composability or at least it needs careful thinking when creating a new build pack if you can still be composed with others or not. This gives rise to the idea of "stacks of buildpacks".
TL;DR Right now I am in favour of "build packs choose their base image", "build packs decide which stack they are in" and "use entrypoints to allow external packages to contribute buildpacks".
Question (after a quick browse of your code @yuvipanda): My impression is that you implement what I wrote in my TL;DR except for using entrypoints. Instead you insert yourself at the top of the build pack search path via some config magic.
This keeps the power of which base images can be used (without a Dockerfile) with the people who are running repo2docker. I'm experimenting with a different approach that gives that power to the people who are making the repositories, using ONBUILD. I'll play with it a bit more and put up a prototype.
Doesn't your prototype already let creators of repos choose the base image via what they write in npangeo-stack
?
@betatim:
Question (after a quick browse of your code @yuvipanda): My impression is that you implement what I wrote in my TL;DR except for using entrypoints. Instead you insert yourself at the top of the build pack search path via some config magic.
Oh absolutely, this isn't a new idea at all. I think a bunch of us also talked about it in a team meeting a few months ago when @craig-willis was there. Just new code. Entrypoints is the next step, but this already works with released repo2docker so makes for a nice demo.
<3 to everyone in this thread for hashing out and moving towards a good set of solutions to a very complex problem!
@betatim
Doesn't your prototype already let creators of repos choose the base image via what they write in n
pangeo-stack
?
Nope it does not. It constrains them to only choosing from PANGEO images. This lets the buildpack make assumptions, such as:
TL;DR Right now I am in favour of "build packs choose their base image", "build packs decide which stack they are in" and "use entrypoints to allow external packages to contribute buildpacks".
+1. We need to figure out a way to deal with ordering when inserted via entrypoints, but that's doable.
@betatim I <3 buildpacks.io. A lot of it is straight from s2i, which was what the very first versions of repo2docker were based off of. I wrote https://github.com/yuvipanda/words/blob/master/content/post/why-not-s2i.md at that time when we switched away. TLDR is composability.
https://github.com/yuvipanda/pangeo-stack-onbuild is the other prototype, where stack authors make -onbuild variants of their images. This lets users directly specify which (supported) image they wanna use, and empowers stack authors to support whatever files they wanna support.
This works today on mybinder.org, once I wait for my push of this onbuild image to complete...
https://mybinder.org/v2/gh/yuvipanda/pangeo-stack-onbuild/master works!
It is based off the base-notebook image from PANGEO stack, but lets users customize it simply with an environment.yml file in the repo directory. It also works with all binders right now, without any customization needed on the part of the operators.
Oh absolutely, this isn't a new idea at all.
My comment wasn't in the spirit of "how lame, this ain't a new idea", I wanted to double check that I hadn't missed anything and that my impression was correct.
Nope it does not. It constrains them to only choosing from PANGEO images.
Ah yes. I don't think this is a drawback, more a feature because you said: buildpacks make assumptions about the base image.
We need to figure out a way to deal with ordering when inserted via entrypoints, but that's doable.
Hopefully we can find a simple way for this and construct extra buildpacks so that they play nice with each other (keep the triggers separate so that order doesn't matter so much) and definitely play nice with the base buildpacks. Seems like something we should write down as part of the "entrypoints contract". Something like "you should follow these guidelines and if you don't we can't offer any support to you or users of your buildpack". A bit like we do with Dockerfile
s right now: if you use one you are on your own.
TLDR is composability.
Nods. I was interested to see that https://buildpacks.io/docs/using-pack/building-app/ (scroll down to the picture) makes me think that pack
can now compose build packs. If we want to chat about this we should probably fork a new thread from this comment on discourse instead of yakking on in this issue.
I'm working with the PANGEO folks to help implement some of this. See https://github.com/pangeo-data/pangeo-stacks/pull/27 for the PR.
You can see a demo here:
https://gist.github.com/yuvipanda/2f139c912f1a4c584bf7e719961a3d02 loaded in mybinder.org as:
https://mybinder.org/v2/gist/yuvipanda/2f139c912f1a4c584bf7e719961a3d02/master
works great! much faster too.
Hey there, I just digged up this issue while searching for a solution for our very specific problem:
I think there were some really good options in this thread, which could solve this issue. But now there hasn't been an update in a year, this issue is still open and I'm not sure if there was ever a clearly defined way to go.. :thinking:
Can you help me out there? :)
I think this is a contentious issue with many similar but slightly different use-cases. People want to change the base image but for different reasons/to achieve different goals. I think we need to divide and conquer to make any progress.
For your particular problem I'd suggest the following (and I think I'd be happy to merge a PR implementing it but others might have other opinions).
A side comment which might solve your problem or not: there is the option to add an "appendix" to every repo being built: https://repo2docker.readthedocs.io/en/latest/usage.html#cmdoption-jupyter-repo2docker-appendix BinderHub can also specify one. Maybe this is enough already? As the name suggests it is an appendix, not a prependix so you can only do stuff that fits with being done at the end.
Adding a CLI flag that sets the name of the base image might solve your particular problem. You could make your own base image (somehow), publish it and then build all repo2docker images on top of that. I'm thinking of something that gets literally pasted into https://github.com/jupyter/repo2docker/blob/023e577eee68d5567ddf783a56ac32d44fd5b64c/repo2docker/buildpacks/base.py#L17. This would give the "owner" of the repo2docker process full control and responsibility to provide a base image that will work. The fact that not every base image will work with repo2docker is (for me) the main blocker to making this functionality wide spread.
What do you think? (if we want to discuss in more detail maybe we should make a new thread for this specific idea)
@betatim @yuvipanda this is really interesting since it could allow users to build images in layers (which of course has other benefits), where they could build one image from another image. Our use-case is to replicate the build mechanism within the jupyter/docker-stacks
repo. If a user wants an image with more stuff, then repo2docker works great but the user may have a giant image once they are done.
With this in mind, we could run:
jupyter-repo2docker https://github.com/norvig/pytudes \
--image-name foo/bar
jupyter-repo2docker https://github.com/norvig/pytudes \
--build-args BASE_IMAGE=foo/bar \
--image-name bizz/bazz
For example, with #909 updating the base image to ubuntu 20.04
may work great for most if not all packages but it could very well be the case that the base image needs to be another version of ubuntu or even another base image altogether for package xyz to work correctly.
Another benefit is that if one were to select a more specific image from buildpack-deps, for example, then the user could remove some/all packages with the apt.txt
file since they would be included in the base image already.
If there is something we could do to help with this effort let us know (poc, draft wip, etc)!
@betatim
I'll throw-in our use case to support this, mostly for consultation. We want users (effectively jovian
user), to be able to modify runtime environment; in particular, to run sudo apt-get update/install ...
over Terminal in the Notebook Server session but also maybe install other scientific libraries which don't really support easily user installation. Being able to throw-in own image would allow to add jovian
to /etc/sudoers
, ignoring that: a) the user is created only later, and b) ${NB_USER} is configurable.
Please do correct me if I'm wrong, but, although being able to toy around w/ system packages in a running environment seems like a fairly standard request, it seems no other approach covers this case, except for using explicit Dockerfile, which I can't use here for other reasons.
Would a new config file like post-build-admin
(or even pre-build-admin
?) that runs a script as root solve some problems without the complexity of a new base image?
Would a new config file like
post-build-admin
(or evenpre-build-admin
?) that runs a script as root solve some problems without the complexity of a new base image?
Yes, it would; post-build-admin
would be better because user would be setup already and I could probably somehow dynamically grab its name (how? some env var?).
Would a new config file like
post-build-admin
(or evenpre-build-admin
?) that runs a script as root solve some problems without the complexity of a new base image?Yes, it would;
post-build-admin
would be better because user would be setup already and I could probably somehow dynamically grab its name (how? some env var?).
Note: I'm aware this would imply whole bunch of bad usage practices like e.g. discussed in https://github.com/jupyterhub/repo2docker/issues/192
Note: I'm aware this would imply whole bunch of bad usage practices like e.g. discussed in #192
That's true, but at least all the bad practices are confined to one file so it's easy to know where to find them when dealing with a broken image, and we could say post-build-admin
has the same support level as Dockerfile
, i.e. not much.
Just bookkeeping another case for postBuildAdmin
: https://gitter.im/jupyterhub/binder?at=5fce35b6fb7f155587ad481a (this one could be also covered by preBuildAdmin
).
One important use case: with dockerhub pull limits, it'd be very useful to redirect the build image to a registry cache. We can support this by just setting the build image via ARG. If that's too controversial, would there be any issues with supporting setting a prefix for the build image?
From https://github.com/jupyterhub/repo2docker/commit/20b081525785fdc6ac524e06150028796b8787ec :
👍
- Is there a list of base images?
- Is there a reference RPM-based base image?
RPM based images are not supported. I would probably say that only Ubuntu versions are supported. And in general if it breaks, you kinda get to keep the pieces.
I feel like we've discussed this a few times but I can't find a specific issue, so:
What if we exposed the ability for CLI users of repo2docker to specify a base image to use instead of the ubuntu base image. I could see this being useful for:
@craig-willis if you have thoughts on the above that'd be helpful!
I think the trick here is that we'd need to lay down some clear rules for what would need to be in the image. It'd have to run the same Ubuntu version, and would need to have jupyterhub / server stuff ready to go. Perhaps it could be treated as an "advanced use case, you should know what you're doing" kinda thing.