Proposal for versioning, immutable historical builds

parente commented 9 years ago

Current Setup

Docker Hub (DH) automatically builds git master and tags it with latest
DH automatically builds version branches like 4.0.x and tags them with 4.0
We only ever roll the latest master branch and latest version branch forward with fixes.
Problems
We don't tag per git commit so that users can easily roll back to prior Docker image versions. This is important when major libraries change (e.g., Spark). Some users want the latest which should go into master, while others want to stay on the current version.
Changes on any branch in the git repo causes a build storm on Docker Hub. (See issue #15). All tags jump to the latest build even when nothing changed in the build definition.
Proposed Improvement
Stop relying on Docker Hub automated builds. (We need more control.)
Stop relying on the 3.2.x and 4.0.x branches and branches in general. (Too coarse grained.)
Adopt the mantra that when a PR wants to bump a library version in one of the stacks, we take it without question.
Setup a build VM that maintainers have access to easily pull from this repo and build images. (Manually for now. We can automate later.)
Beef up the Makefile so that a make latest (or some such) does the following:
- Builds the necessary stacks (change in minimal = build everything, change in scipy = just that stack, etc.) from master HEAD
- Tags the latest built images (even if they were not rebuilt just now) with latest
- Tags those same images with the current git commit SHA
- Pushes all of those tags and new builds to Docker Hub (Note: The client will be smart about only sending deltas / tag metadata for images that did not change.)
  Net Result for End Users

For people that want to walk-up-and-use the latest and greatest:

docker run jupyter/some-stack-name

For people that want to depend on a specific container image configuration tied to some point in time in the docker-stacks git history:

docker run jupyter/some-stack-name:<some-git-sha>

where GitHub / git makes the contents of that particular tagged image visible to the user (i.e., find the SHA in git and look at the Dockerfiles).

parente commented 9 years ago

/cc @rgbkrk @minrk @kbroughton

parente commented 9 years ago

I should add, that in order to affect this change, we'll need to delete the existing automated rebuild repos and recreate them as manual repos from pushed images. There will be an window during this process during which the images are not available. There's just no way to manually push to an automated build repo or convert it to a manual repo.

Maybe we could ask Docker support, but I doubt it'll be possible.

rgbkrk commented 9 years ago

We've definitely gone through this before for when I moved jupyter/demo over to manual builds. Momentary downtime for this is going to happen. It's ok.

rgbkrk commented 9 years ago

I'm definitely a fan of requiring people to pin to the (now available) SHA hashes instead of us maintaining the branches. That's a lot of work otherwise.

parente commented 9 years ago

The problem with Docker SHA is that you don't really know the content without pulling the image. I'm suggesting a bit of make automation that tags the Docker image with the git sha before push so that your view into the image is simply the contents of the git repo at that git sha.

rgbkrk commented 9 years ago

Ohhhhhh

minrk commented 9 years ago

I think that makes sense. I guess there's no room for traditional version tags, since the images contain so many different packages, is there? Can you mock up what the buildbot would look like?

parente commented 9 years ago

I think we could still do additional version tags manually at key points, but what those points are and how to capture the version has eluded me. When the primary process version changes? When a major library changes? With tags like notebook_4.0.1_spark_1.5.0? The good part of the new scheme is that if we ever figure that out or want to tag specific images, we're free to do so at will. The Docker Hub automated build precluded it.

Can you mock up what the buildbot would look like?

The attached PR has the simple make steps that any CI system should be able to run. As a next step after trying it manually for a bit to make sure there are no surprises, we can try to do it via Travis, or Circle, or our own Jenkins, or ...

parente commented 9 years ago

The new makefile is in. I'm running the first build using it in a tmux on the VM documented in the README. Since there were debian fixes, it's a pretty big rebuilding. The box probably needs a bit of disk performance tuning too. Will keep an eye on it.

In the meantime, all the original images are still available on Docker Hub as they were before. So no "outage" while we get the latest and greatest built and pushed.

parente commented 9 years ago

Built master at SHA 9bd33dcc8688 and pushed all tags to Docker Hub. The disk buffering for the image layers was really slow on the VM for some reason compared to other VMs in the same data center. I'll dig into it over time.

parente commented 8 years ago

Update on slowness: https://github.com/docker/docker/pull/15493

Appears that docker 1.9.0 has this PR to address the problem as seen by others as well. In the meantime, we deal with it.

dnk8n commented 5 years ago

I think a maintained document with the versions would be all that is required (even be a json or bash variable manifest which gets referenced by the docker file too). That way whenever there is a less complicated version bump, only one file gets a change. It would also then be a lot easier to find the tag (although scouring the source control should not be a requirement).

Since docker images and git commits can be referenced by multiple tags, a git tag like datascience-1.0.1 and docker tag like 1.0.1 could be applied every time the version manifest changes.

A dev then could look in one place at the version manifest and the tags would be clearly visible. Creating some documentation somewhere central would be trivial.

I am happy to help but I guess this might not be high on your priority list and something like this would require some buy in from a lot of people.

The nice thing is that the current system can remain in place. Extra tags wouldn't break anything for anyone downstream.

At least adding special tags for programming language changes would be ideal. For example, move the tag python3.6.6 to whichever the latest docker image is with that version of python.

Just some ideas from a downstream user perspective. The system is usable now that I know how it works though. With a bit of digging I was able to find the appropriate image.

jupyter / docker-stacks

Proposal for versioning, immutable historical builds #12

Current Setup

Problems

Proposed Improvement

Net Result for End Users