Closed betatim closed 5 years ago
We could look at our launch stats and then integrate a "trigger build on binder" into their CI setup when they merge something to master
.
Some more ideas on where to potentially shave off some megabytes: https://simonwillison.net/2018/Nov/19/smaller-python-docker-images/
At some point we should check if we have gotten to the point were we are now chasing single digit MB improvements and stop.
One thing we could try is to copy the requirements.txt early, run pip install -r requirements.txt wrapped in a "if this fails just continue" block, then copy the full repo, rerun pip install -r requirements.txt which will either be a no-op (if the early run succeeded) or will clean up the breakage from the first run.
This makes sense to me.
This makes sense to me.
Want to give that a go in a experimental PR to give us something to try out? :)
If your image is on Docker Hub Microbadger gives a nice visualisation of the size of each layer, e.g. https://microbadger.com/images/jupyter/tensorflow-notebook
Thanks for the microbadger link! I've been using https://github.com/wagoodman/dive to look at local images and poke around the filesystem to find files that could be removed.
This seems like voodoo magic to me - https://github.com/docker-slim/docker-slim - but I pass it on anyway, just in case...
Just in case anyone stumbles across this thread looking for other tools, https://www.fromlatest.io/#/ will have a go at profiling a Dockerfile and point out ways of optimising it if it thinks it can...
What's the policy on which packages are included in the base environment.yml? For example, nbconvnert (via conda, not pip IIRC) pulls in pandoc and pandoc-citeproc, which are ~85Mb extracted (compared to libpython which is 16Mb).
On nbconvert specifically, is pandoc required? Do people using pip
to manage their requirements complain that it's not present by default? https://github.com/conda-forge/nbconvert-feedstock/issues/24 is the nbconvert-feedstock issue. Perhaps we could split that into an nbconvert-core
feedstock, which doesn't have pandoc, and have nbconvert
depend on it.
Investigated if we can use the trick from #716 (pre-assemble) with environment.yml
and the answer is no. An envrionment.yml
like:
name: stats
dependencies:
- numpy
- pandas
- pip:
- -r requirements.txt
is legal. This means there could be users who rely on the contents of the repo being available.
A potential way out is to try and install the environment.yml
before copying over the repo (and speculating that most users don't rely on this feature), on failure rolling back the transaction and retrying after the copy. Something like:
COPY environment.yml environment.yml
RUN conda env create -f environment.yml || true
COPY . /home/jovyan
RUN conda env create -f environment.yml
this assumes the second conda run is "free" or a no-op if the first succeeded (might not be true).
We have nbconvert
because it is a dependency of the notebook. I think we have discussed (in a previous issue) if we could make the notebook not show some of the "Save as" options that become unavailable if you install nbconvert-light
(aka nbconvert-only-to-html
). I think I'd be -1 on having the UI elements available but the functionality being broken because we install a partial nbconvert.
The nbconvert files probably exist in the layer of the docker image which is shared amongst all images built by the same version of repo2docker. So at least in the context of mybinder.org they are "free" (or at least somewhat less expensive). Users requesting different versions of Python makes this image layer sharing less clear cut :-/
Have you tried using the CNCF Buildpack project? It seems aligned in that it lets you build different parts of your app separately and then combine the layers without having to rebuild everything:
Have you tried using the CNCF Buildpack project?
We've discussed it but I can't refind the issue. "buildpack" turns out to be a terrible search word in r2d land :-/
@betatim from what I can tell, Cloud Native Buildpacks were touched on here: https://github.com/jupyter/repo2docker/issues/487#issuecomment-479858333
(Edit: more accurately, the pack
CLI was touched on, but note that it represents only part of the buildpacks.io effort)
From an outsider's squinting perspective, I feel like CNBs are trying to solve the same class of problems as r2d: efficient, automated, composable builds.
I am chatting with buildpack folks on https://slack.buildpacks.io/ if anyone is interested (cc @yuvipanda)
repo2docker used to be based off s2i, which is (IIRC) very similar in architecture to Cloud Native BuildPacks. We moved away from it for reasons listed in http://words.yuvi.in/post/why-not-s2i/.
Maybe things are different now? I'd say that should be in a different issue than here, though. This issue should be focused on performance improvements we can make with what we have right now.
By way of warning, what follows is hilariously biased: I've several times worked on two generations of buildpack technology over the past 5 years. Pride makes me defensive.
As it was related to me by a Red Hatter I asked, s2i
was created largely because the previous generations of buildpack lifecycles from Heroku (v2a) and Cloud Foundry (v2b) were optimised to a rootfs+tarball target (Heroku's term is "slug", Cloud Foundry's is "droplet"). That was considered unsuitable for OpenShift v3, which was an image-centric architecture.
Whereas Heroku and Cloud Foundry would meet you at code and hid the underlying container infrastructure, OpenShift would meet you at the image, so the latter (this is a personal opinion) had a business need for something like buildpacks to reduce the convenience gap.
But s2i
never really found a home outside of OpenShift, while buildpacks have flourished in two massive, independent but genetically-related ecosystems.
Critically, the emergence of the v2 registry API enables features (particularly layer rebasing) that were previously impossible. In addition Google's Container Tools team developed and maintain the google-gocontainerregistry
library which allows us to perform construction and rebasing operations with or without the docker daemon. The design of CNBs takes full advantage of both of these advances.
By way of speed improvements: We have observed some Java rebuilds drop from minutes to milliseconds. We expect large-cluster rollouts to drop from dozens of hours to potentially minutes.
Edit: I should add, your reasons for moving off s2i
would apply to v2a and v2b buildpack lifecycles as well. One of the motivating problems faced by both Pivotal and Heroku has been exactly this sort of combinatorial explosion; CNBs are designed to make it possible to more easily compose buildpacks developed independently of one another.
Thank you for chiming in, @jchesterpivotal! I very much appreciate the context, and the important new information about v3 of buildpacks :)
CNBs are designed to make it possible to more easily compose buildpacks developed independently of one another.
This gives me hope :) I've opened #720 to continue discussion about buildpacks v3, including a test case that'll help us see if we can base repo2docker off it.
Another idea via @gaborbernat at EuroPython: when we do decide that a requirements.txt
doesn't need the repo contents to run we could reorder the packages in it before running it to increase (a bit) the chances that two repositories with the same dependencies share a cache layer.
Same goes for environment.yml
.
I am not sure if the standard for either file specifies if there is meaning to the order in which packages are listed or not. That is something to check.
@betatim Things like https://github.com/pypa/pip/issues/5761 suggest that order is arbitrary, although you can use multiple requirements.txt
files, executed in order.
To maximise chances of hitting caches, I guess something like this might work?
requirements.txt
sets from cached requirements components (scipy_reqs.txt
, geo_reqs.txt
, etc);user_reqs
) ;scipy_reqs.issubset(user_reqs)
or user_reqs.issuperset(scipy_reqs)
to find whether you have a cached set of packages available and install those if so;user_reqs.difference(scipy_reqs, geo_reqs)
.Of course, if folk pin dependencies, this could make finding hits / matches much harder...
I came across this lines in repo2docker/repo2docker/buildpacks/base.py
# Copy and chown stuff. This doubles the size of the repo, because
# you can't actually copy as USER, only as root! Thanks, Docker!
USER root
COPY src/ ${REPO_DIR}
RUN chown -R ${NB_USER}:${NB_USER} ${REPO_DIR}
I do not understand exactly why this would double the size of the repo, but you can now copy as user! The following would copy as user.
COPY --chown=${NB_UID}:${NB_GID} src/ ${REPO_DIR}
I have used the copy as user and seems to work as advertised. I can open a PR if this seems promising...
@davidrpugh https://github.com/jupyter/repo2docker/issues/164 is the issue about using (or not) --chown
. The recent posts have gone a bit off topic but it is still the best place to go to find context.
In #743 #718 we added support for installing package before we copy over the contents of the repository. This means that if you are re-building a repo where only the README.md
has changed the package install step should be cache (and fast)! Currently we give install.R
, environment.yml
and requirements.txt
this special treatment. The best way to see this in action is to test it locally. (On mybinder.org we need to do some work to increase the chances that a rebuild is assigned to the same node as the original build)
I think except for the COPY --chown
most ideas in this thread have either been tried and merged or rejected so we can close this thread up. Thanks for all the ideas!
This issue is about collecting ideas that could make the images produced by repo2docker smaller, faster to push/pull or faster to build.
I envision this thread to be a meta thread with lots of ideas that then either turn into PRs or get marked as "already done" or "not possible". This way we can use it as an entry point to finding other related issues or PRs.
Why make images smaller and faster to build? From our own experience and the first results from the binder user survey it is clear that faster builds and faster launches is something people really care about.
Smaller images
A few ideas via https://jcrist.github.io/conda-docker-tips.html
conda clean -afy
: already implemented--freeze-installed
: not currently used, unsure if it would help, worth tryingReordering build steps for faster builds
Right now we have to rebuild the whole image (from the point onwards where we copy in the contents of the repo) even if the user only changes a typo in the README. The reasoning behind this is that a
requirements.txt
could contain something like-e .
which leads to thesetup.py
in the repo being executed. This in turn means the setup process could be executing "anything and depends on everything in the repo". There is no way of knowing that the one character change in the README won't change the build result.However I think this is a fringe case and the common case is that people only install packages from PyPI and don't depend on the rest of the repository. How can we make it so this common case is faster and still get the rare case right?
The obvious way to speed up builds and rebuilds is to copy only the
requirements.txt
into the repo, run the install step and then copy over the rest of the repository. This way a change in the README won't break the docker layer cache, which means rebuilds are fast.One thing we could try is to copy the
requirements.txt
early, runpip install -r requirements.txt
wrapped in a "if this fails just continue" block, then copy the full repo, rerunpip install -r requirements.txt
which will either be a no-op (if the early run succeeded) or will clean up the breakage from the first run.We invented
install.R
so we could declare it a mistake to rely on anything in the repository. This means we can copy it over early. This would save a large amount of user pain because R builds are some of the slowest builds we have. (see #716)For
environment.yml
I am not sure if you can install things from a local directory or not. In either case we could treat it like therequirements.txt
case (try, ignore errors, retry).Overhead from docker layers
One thing I was wondering is if an image of the same size and content that has 100 layers (say one each per file added) has more pull overhead than one that consists of only one layer. From watching
docker pull
it seems there are various steps that happen after a layer has been pulled (checksum, unpacking) that could be saved by reducing the number of layers.