pangeo-data / pangeo-docker-images

Docker Images For Pangeo Jupyter Environment
https://pangeo-docker-images.readthedocs.io
MIT License
127 stars 92 forks source link

Initial design discussion #2

Closed scottyhq closed 4 years ago

scottyhq commented 4 years ago

Hi @TomAugspurger, @jhamman, @yuvipanda, @tjcrone, @ocefpaf, @rabernat I've gone ahead and created this repository to try out the ideas discussed in: https://github.com/pangeo-data/pangeo-stacks/issues/125

TomAugspurger commented 4 years ago

Can you summarize the major differences from repo2docker?

I just pushed up my prototype to https://github.com/TomAugspurger/pangeo2docker. The README highlights the high-level design.

Unfortunately I wasn't able to work on it towards the end of last week, so it's not as far along as I hoped.

scottyhq commented 4 years ago

@TomAugspurger just made a few more updates borrowing a few things from your config, here are the current major differences I can think of

for base-image we're already at almost 1/2 current image size (1.19GB compared to 2.05GB decompressed, 327.12 MB on DockerHub). https://hub.docker.com/repository/registry-1.docker.io/pangeodev/base-image

I'm still not sure how to best allow for custom environments with binderhub configurations, the approach i've started with would require people to put in this more verbose Dockerfile in their binder folder: https://github.com/pangeo-data/pangeo-stacks-dev/blob/staging/pangeo-image This creates new layers on top of the base-image, adding to the size considerably https://hub.docker.com/repository/registry-1.docker.io/pangeodev/pangeo-image

scottyhq commented 4 years ago

@TomAugspurger - If I understand correctly your approach in pangeo2docker avoids a base conda environment altogether and generates the Dockerfile from a template+python script. We could definitely also go that route, but not sure how lacking the base environment might influence things (for example using these images on hubs where people create custom conda environments in their NFS home directories).

The other huge benefit of your approach is conda env update is never needed, instead using a single conda env create -f for every image. Looks that would require some functions to combine environment.yml files (or we create a conda metapackage and require all environment files to contain -pangeo-core=x.x.x ?). How might this work for maintaining onbuild binder configurations like this: https://github.com/reproducible-notebooks/hurricane-ike-water-levels/tree/master/binder ?

TomAugspurger commented 4 years ago

approach in pangeo2docker avoids a base conda environment altogether and generates the Dockerfile from a template+python script

Yes, though I think the "avoiding a base environment" is mostly orthogonal to "template + python script". We could just as easily use the standalone conda in https://github.com/pangeo-data/pangeo-stacks-dev/blob/659802b1e099d260e0843bbfa2056e78da52ac69/base-image/Dockerfile#L58. But I'd recommend avoiding that for now. It's a minor optimization (~70 MB IIRC?) and causes headaches elsewhere.

conda env update is never needed

Yes, I'd recommend stealing this idea if possible, since it can have such a big impact on build / solve times. The downside is that (AFAICT) that necessitates having a template Dockerfile, which bumps up the complexity level.

TomAugspurger commented 4 years ago

How might this work for maintaining onbuild binder configurations like this: https://github.com/reproducible-notebooks/hurricane-ike-water-levels/tree/master/binder ?

I think things are fine as long as we only support one level of customization (the base environment.yml + the repo's environment.yml). Nothing in https://github.com/reproducible-notebooks/hurricane-ike-water-levels/blob/master/binder/environment.yml would need to change. That environment.yaml would be merged with the base environment (https://github.com/TomAugspurger/pangeo2docker/blob/master/pangeo2docker/base.yaml) and the union of the packages is passed to conda create.

scottyhq commented 4 years ago

Yes, I'd recommend stealing this idea if possible, since it can have such a big impact on build / solve times. The downside is that (AFAICT) that necessitates having a template Dockerfile, which bumps up the complexity level.

I did a bit of refactoring to use a single Dockerfile instead of a template. This requires separate apt.txt, environment.yml, and postBuild in each image directory but avoids the need for a template and script to merge environment.yml

I think things are fine as long as we only support one level of customization (the base environment.yml + the repo's environment.yml).

If possible I'd like to keep all auxiliary files available for binder. Since the current onbuild setup requires a Dockerfile, I think it's okay to require the full Dockerfile in a binder-compatible repository, and point people towards a template repository (https://github.com/pangeo-data/pangeo-binder-template)?

Trying things out here, but our binders hang with Error from server (BadRequest): container "builder" in pod "build-scottyhq-2dpangeodev-2dbinder-a75d9b-0a2d67" is not available And seeing errors on mybinder.org ‘apt.txt’: No such file or directory I'm guessing because of some working directory setting https://github.com/scottyhq/pangeodev-binder

scottyhq commented 4 years ago

@TomAugspurger, @jhamman, @yuvipanda, @tjcrone, @ocefpaf, @rabernat, @rsignell-usgs. I've put in a little more time today, and I think this is now a solid working prototype! I'm going to summarize where things are at below, and we can discuss on next week's call.

the pangeo-notebook and pangeo-ml images don't contain all the packegs in current pangeo-stacks, but a solid subset, and because they all have the same number of layers uncompressed sizes are currently at least 2x smaller across the board:

ocefpaf commented 4 years ago

all package versions unpinned during build, but captured with conda env export

Take a look at https://github.com/mariusvniekerk/conda-lock/blob/master/README.md for this part. We are working on improving it, packaging and adding artifact hash checks.

tjcrone commented 4 years ago

@scottyhq, thank you for doing this! This is fantastic, and very similar to what I had in mind for a Pangeo image simplification. I have a couple of comments/questions:

  1. Would it be possible to use conditional run statements in your Dockerfile so that all of the build files are not required?
  2. Would it be feasible to build the base-image and start with that in all follow-on images? I know that this would require a second conda solve, but I'm wondering if this might be worth it considering how much more simple follow-on Dockerfiles will be.
scottyhq commented 4 years ago

@tjcrone - thanks! Your feedback is much appreciated. Also, the latest changes incorporate some feedback from @jhamman to use separate conda metapackages for dask worker images and jupyter notebook images (several 100 MBs are saved by not install jupyter packages and lab extensions on dask workers)

Would it be possible to use conditional run statements in your Dockerfile so that all of the build files are not required?

Good idea! I believe this is now implemented correctly.

Would it be feasible to build the base-image and start with that in all follow-on images?

Yes! Initially I wasn't doing this correctly, but by incoporating @yuvipanda's ONBUILD instructions directly into base-image/Dockerfile, we now have the best of both worlds in my opinion - 1) single conda solve 2) inherited configuration. Note the new binder configuration with this setup is virtually unchanged to what people are used to: https://github.com/scottyhq/pangeodev-binder/tree/master/binder

I'm pretty happy with the system now, but it could use some more tests and eyes. The huge wins are that the CI build times are ~10min and the base-notebook image is now an order of magnitude smaller (current latest sizes after pulling from DockerHub):

old base-notebook new base-notebook
2.05GB 894MB
tjcrone commented 4 years ago

@scottyhq, this is really great! I think it is really nice to have things organized like this, in part because of the transparency and simplicity, but also because it is really easy to build images locally for testing. Your README.md is especially helpful. Thank you so much for putting this together.

I'm currently on the fence regarding a different image for the worker. Early in the Pangeo project we had separate images, but I think we decided that caring for separate images wasn't worth the benefit of having a smaller worker image. And here, because we bifurcate at base-notebook/worker, a full Pangeo deployment is a stack of five images. Without different notebook/worker images, we could collapse this all the way back down to two. I think this is worth a discussion because we may be able to speed up worker creation either by storing uncompressed docker layers, or by cloning worker nodes such that the docker images are already present on new machines. I'm not sure if the infrastructure exists for either of these solutions, but we should look into it.

This is awesome. Thank you!

scottyhq commented 4 years ago

I'm currently on the fence regarding a different image for the worker.

Agreed @tjcrone . I can imagine two directions for this repository:

1) call this repo "docker-base" and publish only base images (so base-image, base-worker, and base-notebook), leaving any environment and customization choices to others. For example, https://github.com/pangeo-data/pangeo-stacks no longer builds a 'base image' and just has 'notebook' images for pangeo-notebook, pangeo-esip, pangeo-ml etc... but these all use the same base from here (FROM pangeo/base-image:2020.02.27)

2) Use this repo as a direct replacement for the existing pangeo-stacks, intended for building images used across HUBs. There is value in providing pre-built images to users and using the exact same image like pangeo-ml in both Google and AWS for consistency and debugging errors. The difficulty is agreeing on what libraries should be in 'pangeo-notebook' and maintaining 'pangeo-ml' with many heavy dependencies is a burden.

jhamman commented 4 years ago

From my perspective, this is headed toward a drop in replacement for pangeo-stacks. Nice work @scottyhq!

I am personally a pretty big fan of the notebook/worker image distinction. For the user, this distinction should be both transparent and optional. What do I mean by this? We can set an environment variable in each notebook image that points to the corresponding dask worker image and let Dask-{kubernetes,gateway} pick up that image during initialization. Of course, the user should be able to overwrite this with other options (within reason).

What's next here?

1) Pangeo-stacks has a documentation page built by sphinx. Is there interest in porting that over and providing more thorough user and developer documentation?

2) Estimate/outline the downstream breakage potential if we start publishing these new images under the old namespace. We should should communicate the path forward for existing users of pangeo-stacks.

3) what else? blog post? vacation?

jhamman commented 4 years ago

I also meant to comment on this:

call this repo "docker-base" and publish only base images...

I'm not a big fan of this option. We already know there is a use case for intermediate and/or complete images. We should make it clear how users can extend the base images for custom applications while providing a few complete images for common workflows.

scottyhq commented 4 years ago

Thanks for the input @jhamman .

Pangeo-stacks has a documentation page built by sphinx. Is there interest in porting that over and providing more thorough user and developer documentation?

For now I think we can just add a bit more detail in the readme.md? At the very least It would be nice to put the full list of conda packages in more obvious place. One alternative to posting them on a separate website is just adding them to the repo https://github.com/pangeo-data/pangeo-stacks-dev/issues/15

Estimate/outline the downstream breakage potential if we start publishing these new images under the old namespace. We should should communicate the path forward for existing users of pangeo-stacks.

  • There are no more -onbuild images, so all downstream configurations must start using FROM pangeo/base-image:DATE
  • The default conda environment is now named 'pangeo' instead of 'notebook'. It could be anything, but the current base-image has this hardcoded, so environment.yml files have to use that as env name
  • We are now pinning jupyterlab-related and dask-related package versions with conda-metapackages (https://github.com/pangeo-data/conda-metapackages), so if people try to install new major versions conda will complain. docs should point out default conda config in /srv/conda/.condarc and pinned versions in /srv/conda/envs/pangeo/conda-meta/pinned.
  • maybe some other things...

    We can set an environment variable in each notebook image that points to the corresponding dask worker image and let Dask-{kubernetes,gateway} pick up that image during initialization.

I like this idea. but it does add complexity. We currently set the dask image to be ${JUPYTER_IMAGE_SPEC} in our dask_config.yml which is copied to /srv/conda/envs/pangeo/etc/dask/dask.yml https://github.com/pangeo-data/pangeo-stacks-dev/blob/0725cf8dfac8a3711782ec5a1f1090a3f60af0ab/base-image/dask_config.yml#L35 We can simply change that to DASK_IMAGE_SPEC, but for BinderHubs JUPYTERHUB_IMAGE_SPEC isn't known until jupyter launches, so you the default start script would look like:

#!/bin/bash -l

# ==== ONLY EDIT WITHIN THIS BLOCK =====

export MY_ENV_VAR="myValue"
export DASK_IMAGE_SPEC=${JUPYTER_IMAGE_SPEC}

# ==== ONLY EDIT WITHIN THIS BLOCK =====

exec "$@"

To end up with the default behavior: DASK_IMAGE_SPEC=pangeoaccess/binder-stagingscottyhq-2dpangeodev-2dbinder-a75d9b:7fb9ab8292797a97425685d168f13ac25358b074

@tjcrone are you still on the fence? Alternatively we just leave this out and let advanced users customize their own dual dask/notebook images. Right now I think we're only saving a couple 100 MB in the worker images, so it might not be worth the extra complexity...

what else? blog post? vacation?

vacation.

tjcrone commented 4 years ago

I am no longer on the fence. I think having a separate worker and notebook image, and bifurcating the images at base-notebook/worker so that a pangeo deployment is a five (!) image stack, is not worth whatever is gained. I would argue for a structure that has one pangeo-base image including the smallest most basic set of packages to make Pangeo work, and then a set of domain-specific images (e.g. pangeo-ocean) building from it with extra packages defined in the environment file. Eventually perhaps users can select the image they want when they log in, or they continue to select the image through the subdomain. Also users have always been able to change the worker image either through the environment variable or through ~/.config/dask/dask.yaml. Not sure if that works for BinderHubs. But I doubt most users will want or need workers with different images. As the use of dask-cloudprovider becomes more common, even more control will go back to the users which is a good thing.

In the meantime, I think efforts to make workers start more quickly could get a lot of mileage out of figuring out how to get docker to push uncompressed layers (which I'm pretty sure is not possible on dockerhub but might be something that could be added when using a private container registry), and/or figuring out how to get Kubernetes to copy docker images directly from other nodes in the cluster when a new node starts. Does anyone know if this is possible?

TomAugspurger commented 4 years ago

Can't people who don't want split worker / notebook images just use the notebook image on their workers? Won't the notebook image always be a superset of the worker image?

Or is there always some complexity that leaks through to users who just want the one image?

tjcrone commented 4 years ago

@TomAugspurger, yes, users can currently control what image is run on workers. But I think most users will use the system as it is structured by default. The complexity comes with having to build and support five separate images for each Pangeo deployment (instead of two, one of which should require changes less often), keeping the package sets/versions between all worker and notebook images aligned, and having to explain/learn this complexity when others want to make changes or improvements. All of this to save ~200 MB in the worker images. I'm not seeing how this trade off is worth it, but maybe I'm missing something.

In my testing, the OOI image layers are downloaded to worker nodes in a few seconds (from within our co-located Azure registry), but layer decompression can take many minutes on our standard workers. This problem is compounded because both staging and prod download/decompress the image at the same time to the same machine. I feel like there should be a solution to this issue but I have not found it yet. Finding the solution may mean that we can stop worrying about worker image size.

tjcrone commented 4 years ago

When a Dask cluster needs to spawn new nodes, the worker images have usually already been downloaded to one or more nodes in the cluster. And these nodes in many cases are likely located on the same physical machines, or ones that are nearby. Why can't Kubernetes create these new nodes without requiring a pull/decompress of these images that are already present? Copying images from one machine to another by hand is possible. But perhaps doing this would break the Kubernetes model in an unacceptable way. Or maybe it is possible and I don't know how to do it.

Another potentially fruitful way forward for making Pangeo/Dask worker creation much faster would be to coax Docker into pushing uncompressed layers. With a local private registry, this could be a game-changer. I feel like if this is not possible now, it is a feature that might be added to docker without a huge amount of effort.

I don't really know where to start on either of these issues. Anyone have any ideas?

scottyhq commented 4 years ago

Thanks @tjcrone - I'd suggest opening up a new issue for image pull/decompress speed in pangeo-cloud-federation repo or dask_kubernetes (maybe you already have). We're likely to get more experts chiming in on those more visible repositories.

scottyhq commented 4 years ago

Based on feedback on the last pangeo cloud ops call we are back to 3 key images: base-notebook, pangeo-notebook, and ml-notebook and intend to directly replace pangeo-stacks with this repo.

I've made one more major change for the 2020:03.11 images in the base-image Dockerfile. You can now use packages.txt instead of environment.yml (environment.yml still works but you have to specify the pangeo-notebook metapackage:

name: pangeo
channels:
  - conda-forge
dependencies:
# NOTE!!!!! must specify pangeo-notebook metapackage if using environment.yml
  - pangeo-notebook
  - geoviews

Alternatively using packages.txt and requirements.txt installs pangeo-notebook metapackage automatically, so you only have to include additional analysis packages. For example see: https://github.com/scottyhq/pangeo-binder-packages https://github.com/scottyhq/pangeo-binder-explicit

rsignell-usgs commented 4 years ago

using packages.txt and requirements.txt installs pangeo-notebook metapackage automatically

@scottyhq , how does this automatic installation of the pangeo-notebook metapackage happen?
(I looked through the repo and couldn't figure it out). 😞

scottyhq commented 4 years ago

@rsignell-usgs the metapackage is specified as a default package, so any conda create command will install it: https://github.com/pangeo-data/pangeo-stacks-dev/blob/5c6c836acfca028efc1a8710b97b7e78325930de/base-image/condarc#L8-L9

that conda configuration file is installed in the base image here: https://github.com/pangeo-data/pangeo-stacks-dev/blob/5c6c836acfca028efc1a8710b97b7e78325930de/base-image/Dockerfile#L47

The pangeo-notebook metapackage includes: https://github.com/conda-forge/pangeo-notebook-feedstock/blob/80eddcee8a865f7ce5dd4d9ed054587183fadce1/recipe/meta.yaml#L11-L19

requirements:
  run:
    - pangeo-dask =0.0.2
    - dask-gateway =0.6*
    - dask-labextension =1.1*
    - jupyter-server-proxy =1.2*
    - jupyterhub =1.1*
    - jupyterlab =1.2*
    - nbgitpuller =0.8*

and the pangeo-dask metapackage includes: https://github.com/conda-forge/pangeo-dask-feedstock/blob/c083e9e2f9b8e7252b979a936fe0860c72c90cb9/recipe/meta.yaml#L11-L16

requirements:
  run:
    - dask-kubernetes =0.10*
    - dask =2.11*
    - distributed =2.11*
    - tornado =6*
rsignell-usgs commented 4 years ago

@scottyhq, we can still use the good ol' environment.yml if we want to, right? I like the clarity that provides about what is actually in the environment.

Unless there is some performance or other technical benefit to using .condarc, packages.txt and requirements.txt to set the environment?

scottyhq commented 4 years ago

@scottyhq, we can still use the good ol' environment.yml if we want to, right? I like the clarity that provides about what is actually in the environment.

Yes I agree. we can instead require all images to explicity include it

Unless there is some performance or other technical benefit to using .condarc, packages.txt and requirements.txt to set the environment? 1) You only have to specify pangeo-notebook=0.0.2 in one place and all images in this repository get it (this is mainly a carry-over from pangeo-stacks, it is convenient but not necessary)

2) conda create with packages.txt is more flexible, in particular explicit package lists (which would be helpful for binders) https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#building-identical-conda-environments

ocefpaf commented 4 years ago
  • You only have to specify pangeo-notebook=0.0.2 in one place and all images in this repository get it (this is mainly a carry-over from pangeo-stacks, it is convenient but not necessary)

While that is convenient it obfuscate the dependencies a little bit. I don't have strong opinions on how pangeo wants to set this up but I'd go for a fully explicit list of dependencies instead of an "inheritance" model.

Indeed. My only concern with that is the use of pangeo outside of the binder world, like a single user trying to reproduce pangeo's env locally. Maybe we could have a nice and readable environment.yaml, that would serve as an entry point for many users, and add some sort of lock file solution from that env to be used with binder. That way we could even track the evolution of the env file by just updating the lock file every now and then. Also, human updates would always be in the env file, the lock file would be updated by automation only.

scottyhq commented 4 years ago

I've taken @ocefpaf and @rsignell-usgs 's feedback onboard and changed some things. I'm pretty confident in starting to use this in place of the existing pangeo-stacks. Summary of recent changes:

  1. Each notebook folder has all the files needed for its own build. In other words, no default packages get installed behind the scenes and pangeo-notebook does not build off of base-notebook. The only 'onbuild' image is pangeo/base-image, everything builds off of that, so binder repos previously using -onbuild images now have to use pangeo/base-image:DATE (see https://github.com/scottyhq/pangeodev-binder). I've tried to clarify this in the repo readme.

  2. The other big change is using conda-lock to fully specify each notebook environment with spec-file.txt. The downside of this is needing to run the pre-solve step locally (edit environment.yml and run conda-lock -f $1/environment.yml -p linux-64). This is a bit inconvenient, but the huge advantages of this are reproducible builds (see binder docs, and conda docs on this) and the ability to see exactly what packages change between builds:

https://github.com/pangeo-data/pangeo-stacks-dev/compare/2020.03.27..2020.03.28

Note that this environment locking process isn't required, but I think it's really important. The build will use environment.yml if there is no spec-file.txt. Also the CI could probably be spruced up to run conda-lock via PR slash commands (https://github.com/peter-evans/slash-command-dispatch). The key is ensuring the locked packages are included in the commit to master branch. So ideally someone edits environment.yml creates a PR, then a commit is automatically added to that PR with spec-file.txt

ocefpaf commented 4 years ago

So ideally someone edits environment.yml creates a PR, then a commit is automatically added to that PR with spec-file.txt

I wonder if we could put a pre-commit hook or GitHub action for that.

scottyhq commented 4 years ago

Closing since this repository has been humming along successfully for 6 months