The need for speed - Githubissues

betatim commented 5 years ago

This issue is about collecting ideas that could make the images produced by repo2docker smaller, faster to push/pull or faster to build.

I envision this thread to be a meta thread with lots of ideas that then either turn into PRs or get marked as "already done" or "not possible". This way we can use it as an entry point to finding other related issues or PRs.

Why make images smaller and faster to build? From our own experience and the first results from the binder user survey it is clear that faster builds and faster launches is something people really care about.

Smaller images

A few ideas via https://jcrist.github.io/conda-docker-tips.html

don't use MKL: we already use conda-forge as default -> openblas is our default, no more gains possible
run conda clean -afy: already implemented
--freeze-installed: not currently used, unsure if it would help, worth trying
remove additional unnecessary files: we should do this
use a smaller base image: not applicable as the ubuntu base we use should be present on all nodes in a BinderHub cluster -> this is "free" as it doesn't need pulling or pushing. Double check this is actually true

Reordering build steps for faster builds

Right now we have to rebuild the whole image (from the point onwards where we copy in the contents of the repo) even if the user only changes a typo in the README. The reasoning behind this is that a requirements.txt could contain something like -e . which leads to the setup.py in the repo being executed. This in turn means the setup process could be executing "anything and depends on everything in the repo". There is no way of knowing that the one character change in the README won't change the build result.

However I think this is a fringe case and the common case is that people only install packages from PyPI and don't depend on the rest of the repository. How can we make it so this common case is faster and still get the rare case right?

The obvious way to speed up builds and rebuilds is to copy only the requirements.txt into the repo, run the install step and then copy over the rest of the repository. This way a change in the README won't break the docker layer cache, which means rebuilds are fast.

One thing we could try is to copy the requirements.txt early, run pip install -r requirements.txt wrapped in a "if this fails just continue" block, then copy the full repo, rerun pip install -r requirements.txt which will either be a no-op (if the early run succeeded) or will clean up the breakage from the first run.

We invented install.R so we could declare it a mistake to rely on anything in the repository. This means we can copy it over early. This would save a large amount of user pain because R builds are some of the slowest builds we have. (see #716)

For environment.yml I am not sure if you can install things from a local directory or not. In either case we could treat it like the requirements.txt case (try, ignore errors, retry).

Overhead from docker layers

One thing I was wondering is if an image of the same size and content that has 100 layers (say one each per file added) has more pull overhead than one that consists of only one layer. From watching docker pull it seems there are various steps that happen after a layer has been pulled (checksum, unpacking) that could be saved by reducing the number of layers.

betatim commented 5 years ago

We could look at our launch stats and then integrate a "trigger build on binder" into their CI setup when they merge something to master.

betatim commented 5 years ago

Some more ideas on where to potentially shave off some megabytes: https://simonwillison.net/2018/Nov/19/smaller-python-docker-images/

At some point we should check if we have gotten to the point were we are now chasing single digit MB improvements and stop.

saulshanabrook commented 5 years ago

One thing we could try is to copy the requirements.txt early, run pip install -r requirements.txt wrapped in a "if this fails just continue" block, then copy the full repo, rerun pip install -r requirements.txt which will either be a no-op (if the early run succeeded) or will clean up the breakage from the first run.

This makes sense to me.

betatim commented 5 years ago

This makes sense to me.

Want to give that a go in a experimental PR to give us something to try out? :)

manics commented 5 years ago

If your image is on Docker Hub Microbadger gives a nice visualisation of the size of each layer, e.g. https://microbadger.com/images/jupyter/tensorflow-notebook

betatim commented 5 years ago

Thanks for the microbadger link! I've been using https://github.com/wagoodman/dive to look at local images and poke around the filesystem to find files that could be removed.

psychemedia commented 5 years ago

This seems like voodoo magic to me - https://github.com/docker-slim/docker-slim - but I pass it on anyway, just in case...

Just in case anyone stumbles across this thread looking for other tools, https://www.fromlatest.io/#/ will have a go at profiling a Dockerfile and point out ways of optimising it if it thinks it can...

TomAugspurger commented 5 years ago

What's the policy on which packages are included in the base environment.yml? For example, nbconvnert (via conda, not pip IIRC) pulls in pandoc and pandoc-citeproc, which are ~85Mb extracted (compared to libpython which is 16Mb).

On nbconvert specifically, is pandoc required? Do people using pip to manage their requirements complain that it's not present by default? https://github.com/conda-forge/nbconvert-feedstock/issues/24 is the nbconvert-feedstock issue. Perhaps we could split that into an nbconvert-core feedstock, which doesn't have pandoc, and have nbconvert depend on it.

betatim commented 5 years ago

Investigated if we can use the trick from #716 (pre-assemble) with environment.yml and the answer is no. An envrionment.yml like:

name: stats
dependencies:
  - numpy
  - pandas
  - pip:
    - -r requirements.txt

is legal. This means there could be users who rely on the contents of the repo being available.

A potential way out is to try and install the environment.yml before copying over the repo (and speculating that most users don't rely on this feature), on failure rolling back the transaction and retrying after the copy. Something like:

COPY environment.yml environment.yml
RUN conda env create -f environment.yml || true
COPY . /home/jovyan
RUN conda env create -f environment.yml

this assumes the second conda run is "free" or a no-op if the first succeeded (might not be true).

betatim commented 5 years ago

We have nbconvert because it is a dependency of the notebook. I think we have discussed (in a previous issue) if we could make the notebook not show some of the "Save as" options that become unavailable if you install nbconvert-light (aka nbconvert-only-to-html). I think I'd be -1 on having the UI elements available but the functionality being broken because we install a partial nbconvert.

The nbconvert files probably exist in the layer of the docker image which is shared amongst all images built by the same version of repo2docker. So at least in the context of mybinder.org they are "free" (or at least somewhat less expensive). Users requesting different versions of Python makes this image layer sharing less clear cut :-/

saulshanabrook commented 5 years ago

Have you tried using the CNCF Buildpack project? It seems aligned in that it lets you build different parts of your app separately and then combine the layers without having to rebuild everything:

betatim commented 5 years ago

Have you tried using the CNCF Buildpack project?

We've discussed it but I can't refind the issue. "buildpack" turns out to be a terrible search word in r2d land :-/

jchesterpivotal commented 5 years ago

@betatim from what I can tell, Cloud Native Buildpacks were touched on here: https://github.com/jupyter/repo2docker/issues/487#issuecomment-479858333

(Edit: more accurately, the pack CLI was touched on, but note that it represents only part of the buildpacks.io effort)

From an outsider's squinting perspective, I feel like CNBs are trying to solve the same class of problems as r2d: efficient, automated, composable builds.

saulshanabrook commented 5 years ago

I am chatting with buildpack folks on https://slack.buildpacks.io/ if anyone is interested (cc @yuvipanda)

yuvipanda commented 5 years ago

repo2docker used to be based off s2i, which is (IIRC) very similar in architecture to Cloud Native BuildPacks. We moved away from it for reasons listed in http://words.yuvi.in/post/why-not-s2i/.

Maybe things are different now? I'd say that should be in a different issue than here, though. This issue should be focused on performance improvements we can make with what we have right now.

jchesterpivotal commented 5 years ago

By way of warning, what follows is hilariously biased: I've several times worked on two generations of buildpack technology over the past 5 years. Pride makes me defensive.

As it was related to me by a Red Hatter I asked, s2i was created largely because the previous generations of buildpack lifecycles from Heroku (v2a) and Cloud Foundry (v2b) were optimised to a rootfs+tarball target (Heroku's term is "slug", Cloud Foundry's is "droplet"). That was considered unsuitable for OpenShift v3, which was an image-centric architecture.

Whereas Heroku and Cloud Foundry would meet you at code and hid the underlying container infrastructure, OpenShift would meet you at the image, so the latter (this is a personal opinion) had a business need for something like buildpacks to reduce the convenience gap.

But s2i never really found a home outside of OpenShift, while buildpacks have flourished in two massive, independent but genetically-related ecosystems.

Critically, the emergence of the v2 registry API enables features (particularly layer rebasing) that were previously impossible. In addition Google's Container Tools team developed and maintain the google-gocontainerregistry library which allows us to perform construction and rebasing operations with or without the docker daemon. The design of CNBs takes full advantage of both of these advances.

By way of speed improvements: We have observed some Java rebuilds drop from minutes to milliseconds. We expect large-cluster rollouts to drop from dozens of hours to potentially minutes.

Edit: I should add, your reasons for moving off s2i would apply to v2a and v2b buildpack lifecycles as well. One of the motivating problems faced by both Pivotal and Heroku has been exactly this sort of combinatorial explosion; CNBs are designed to make it possible to more easily compose buildpacks developed independently of one another.

yuvipanda commented 5 years ago

Thank you for chiming in, @jchesterpivotal! I very much appreciate the context, and the important new information about v3 of buildpacks :)

CNBs are designed to make it possible to more easily compose buildpacks developed independently of one another.

This gives me hope :) I've opened #720 to continue discussion about buildpacks v3, including a test case that'll help us see if we can base repo2docker off it.

betatim commented 5 years ago

Another idea via @gaborbernat at EuroPython: when we do decide that a requirements.txt doesn't need the repo contents to run we could reorder the packages in it before running it to increase (a bit) the chances that two repositories with the same dependencies share a cache layer.

Same goes for environment.yml.

I am not sure if the standard for either file specifies if there is meaning to the order in which packages are listed or not. That is something to check.

psychemedia commented 5 years ago

@betatim Things like https://github.com/pypa/pip/issues/5761 suggest that order is arbitrary, although you can use multiple requirements.txt files, executed in order.

To maximise chances of hitting caches, I guess something like this might work?

define some "common" requirements.txt sets from cached requirements components (scipy_reqs.txt, geo_reqs.txt, etc);
define sets corresponding to packages in a user requirements file as a set (user_reqs) ;
do a scipy_reqs.issubset(user_reqs) or user_reqs.issuperset(scipy_reqs) to find whether you have a cached set of packages available and install those if so;
identify remaining non-cached requirements: user_reqs.difference(scipy_reqs, geo_reqs).

Of course, if folk pin dependencies, this could make finding hits / matches much harder...

davidrpugh commented 5 years ago

I came across this lines in repo2docker/repo2docker/buildpacks/base.py

# Copy and chown stuff. This doubles the size of the repo, because
# you can't actually copy as USER, only as root! Thanks, Docker!
USER root
COPY src/ ${REPO_DIR}
RUN chown -R ${NB_USER}:${NB_USER} ${REPO_DIR}

I do not understand exactly why this would double the size of the repo, but you can now copy as user! The following would copy as user.

COPY --chown=${NB_UID}:${NB_GID} src/ ${REPO_DIR}

I have used the copy as user and seems to work as advertised. I can open a PR if this seems promising...

betatim commented 5 years ago

@davidrpugh https://github.com/jupyter/repo2docker/issues/164 is the issue about using (or not) --chown. The recent posts have gone a bit off topic but it is still the best place to go to find context.

betatim commented 5 years ago

In #743 #718 we added support for installing package before we copy over the contents of the repository. This means that if you are re-building a repo where only the README.md has changed the package install step should be cache (and fast)! Currently we give install.R, environment.yml and requirements.txt this special treatment. The best way to see this in action is to test it locally. (On mybinder.org we need to do some work to increase the chances that a rebuild is assigned to the same node as the original build)

I think except for the COPY --chown most ideas in this thread have either been tried and merged or rejected so we can close this thread up. Thanks for all the ideas!

jupyterhub / repo2docker

The need for speed #707

Smaller images

Reordering build steps for faster builds

Overhead from docker layers