Open scottyhq opened 4 years ago
I think these ideas are great @scottyhq. I especially like the idea of a conda metapackage. Another potential way to simplify is to move the image configurations from pangeo-cloud-federation into pangeo-stacks, so that pangeo-stacks handles the image builds, and pangeo-cloud-federation handles authentication and the helm deployments on the various cloud providers.
:heavy_plus_sign: :100: for 2.
+1 for 2!
I like 1 very much. I particularly like the idea of coupling this with #82, allowing users to run the containers on their own cloud computing accounts without any hub in between. This would solve the common use case of people who want to just run their own personal Pangeo.
Yeah, would be great to make the script in #82 into a real python package (or hosted website)
I'm not sure that only publishing an onbuild and dropping the onbuild name will do a whole lot to simplify things. We still would have an onbuild image, inside of which r2d_overlay.py is still executed, so that the base-notebook can be used as a starting point for the Pangeo deployment images. I feel like there has to be a simpler way out there.
Thanks @scottyhq for starting this discussion and everyone else for your comments.
What is this repository for? in its most basic sense, pangeo-stacks is simply a place to curate docker images that work on pangeo's cloud deployments (e.g. binder, jupyterhub, dask-kubernetes, dask-gateway). After we started using hubploy to manage Pangeo's deployments, I started this pangeo-stacks to allow us to share images between binder and jupyterhub deployments.
Why repo2docker? In the beginning, repo2docker (r2d) was the fastest path to a minimally viable set of images that were jupyterhub ready. r2d allowed us to forgo the complicated Dockerfiles that we had been previously using and use an established specification to dictate how the contents of images were defined. The trade off for this simplicity is that we inherited r2d's quirks (aka features).
What do we really need pangeo-stacks to do? I think we can list a few objectives for this project that will hopefully guide our future development decisions:
I think we are well on our way to making some of these things happen but there is lots more work to do. You may argue that r2d gives us far more than we need (i.e. multi-language support) or that r2d_overlay
adds too much complexity. I'm open to hearing ideas for how we achieve the objectives above without these two tools.
I'm personally keen to maintain the repo2docker specification (num. 2 from my list above) even if we drop repo2dockder itself. I think this structure is well defined enough to be useful but we could write a more opinionated version of r2d that focuses on decreasing image size and extensibility. (This is basically what r2d_overlay does).
I would also be a fan of trying out the conda-metapackage
idea. I think this would go a long way toward addressing compatibility (num. 4 from my list above). The challenge will be in defining what belongs in the core pangeo package.
Finally, onbuild. Again, I think r2d_overlay is a great start here but we haven't done enough to document its behavior, develop a test suite, or provide end user error handling. We should probably break r2d_overlay out of this repo and distribute it as its own package.
I tried to write a more opinionated version of the Conda buildpack in repo2docker - https://github.com/yuvipanda/repo2conda2docker. After a few hours, I realized I was practically just re-implementing all its parts.
As an example, I was basing it off the miniconda3 image, and re-using the conda install there instead of installing our own. Then I realized that the miniconda3 image has conda installed as root user - and that means our users can't install packages at runtime! That requires we change permissions of the entire conda install in the base, which immediately doubles the image size from about 450M to about 1G. We can get around this by installing miniconda ourselves with the appropriate permissions, but that is exactly what repo2docker does!
This exercise has convinced me that if you're using repo2docker just with pure python stuff, it already pretty much does the right things you need for our use cases. A lot of the warts are really bug fixes, and I felt I was re-inventing and re-fixing them with repo2conda2docker. I suspect any PANGEO-specific builder will have to deal with similar issues.
To that end, I think if we want the PANGEO-stackss repo to be more optimized, we should write hand-optimized Dockerfiles, and maintain them. This is, IMO, a lot more work than just doing repo2docker. It might be worth it for the conceptual simplicity for those who understand Dockerfiles, but that's a call for y'all to make.
r2d_overlay.py definitely needs to be its own project, and needs more tests!
Pangeo only gets as simple, transparent, and easy to modify as we want it to be. I just spent nearly a week trying to install a package (gmt=6.0.0) that would not play nice with some of the packages in the pangeo-notebook. Not only would conda not solve, but there was also a package in the pangeo-notebook that was causing GMT to segmentation fault, which of course kills the kernel. Ouch.
The solution was to completely refactor the OOI image so that instead of starting with pangeo-notebook, it starts with base-notebook. I also needed to learn exactly how the Pangeo images are built, so that I could rebuild the OOI image again locally (tens of times unfortunately) to find the offending package. Here is what I learned:
The Pangeo base-notebook does not have a Dockerfile, so repo2docker generates one for us. We don't really know what is/was in the generated Dockerfile because it is not retained, and the base-notebook can change if repo2docker changes how it generates this starting Dockerfile. What's in the Pangeo base-notebook? Well, a lot of stuff that we didn't explicitly specify.
In order to continue using a repo2docker-like build starting with the base-notebook image, we install an undocumented script called r2d_overlay.py inside of an alternate build of our base-notebook which basically just reimplements repo2docker, but unlike repo2docker, it allows us to start with a base image. This issue has been discussed at length, but it doesn't appear that any changes were made upstream, so Yuvi rolled us this script. We are the only group using r2d_overlay.py.
This r2d_overlay.py scheme is implemented again in another onbuild image, the pangeo-notebook image, which is used as the starting point for most of the Pangeo deployment images.
I guess it seems simple now that I write it all out. But I had to dig deep into a lot of nested configuration, find undocumented scripts, read the repo2docker source code, and teach myself a lot of stuff that I'm sure at least one or two of you already knew. :-) I'm not convinced that this is simpler or easer to understand/maintain/modify than a set of straight Dockerfiles, and so I'm going to +1 @yuvipanda's suggestion regarding the Dockerfiles.
One reason to switch to Dockerfiles is that it might make it easier to have Pangeo notebook layers that can be mixed and matched, and will obviate the need for someone like me to dig in so deep to build an image that diverges from the pangeo-notebook, something I was hoping to avoid. But maybe it won't. I am not sure. If a set of Dockerfiles won't make this easier, this is still something I think we should strive for. What are some other solutions?
Oh, and if you were wondering, it was the pyinterp package that was making GMT segmentation fault. This is also a package with lots of C/C++ so that makes sense. But not exactly sure what was going on. We also needed to remove tensorflow, keras, pytorch-cpu, and basemap, all of which are incondapatible with gmt=6.0.0.
Another interesting thing I learned, which I 'm sure everyone on this thread already knows, is that repo2docker will show something similar to the base-notebook Dockerfile (but not apparently not the actual Dockerfile) by passing the --no-build and --debug options. Which is a good start for a Pangeo Dockerfile. Fix it up some, add what is in the repo2docker buildpack that we use, add a couple of lines to install the environment and apt packages, do the postBuild, and as far as I can tell, we could have a single Dockerfile for all the deployments. Builds using an ultra-opinionated Pangeo image builder, could be structured to cobble together all the environment.yml files in a directory, such as the "pangeo_required_environment.yml", "machine_learning_environment.yml", etc., and install them using the one Dockerfile. Ultra simple. A la carte. Easy to add and subtract from.
I guess it seems simple now that I write it all out. But I had to dig deep into a lot of nested configuration, find undocumented scripts, read the repo2docker source code, and teach myself a lot of stuff that I'm sure at least one or two of you already knew. :-)
Ah, this is the core of the problem, @tjcrone :)
@yuvipanda, yes! Thank you! I do think that is the core of the problem. It took me a really long time to figure out how this is all put together and spent nearly a week adding one package.
Maybe the core of the core of the problem is that we are using repo2docker which doesn't really do what we need it to do, and then another script to sortof fix what repo2docker doesn't do, buried inside of an onbuild image, instead of just using a Dockerfile which does everything that all of these things do, more transparently, and I would argue, actually more simply.
One day we are going to get to an a la carte Pangeo packaging system. I know we can do it!
Some of the following comments are based on the conference call this morning.
Which is a good start for a Pangeo Dockerfile. Fix it up some, add what is in the repo2docker buildpack that we use, add a couple of lines to install the environment and apt packages, do the postBuild, and as far as I can tell, we could have a single Dockerfile for all the deployments
I like this idea, and think we should try it out in a separate repo to experiment and get a feel for pros and cons. @yuvipanda has provided a great starting point here https://github.com/yuvipanda/repo2conda2docker/blob/master/repo2conda2docker/__init__.py ! We could finally try the 'strict' conda-forge channel setting per @ocefpaf 's suggestion (https://conda-forge.org/docs/user/tipsandtricks.html) by modifying this https://github.com/yuvipanda/repo2conda2docker/blob/master/repo2conda2docker/environment.yml. Hopefully this would significantly reduce image size and environment solving times.
One advantage I also anticipate for sticking to just Dockerfiles is faster build times though the use of docker build --cache-from
. I've noticed that onbuild
images using r2d_overlay.py
currently rebuild even if the configuration is unchanged. See https://github.com/pangeo-data/pangeo-stacks/actions/runs/35778209, where just the pangeo-notebook
environment.yml is changed. pangeo-base
builds in 1m 38s because the configuration is unchanged from pangeo-base:latest and cache is used, but pangeo-esip
takes the usual 14m 9s.
If not using repo2docker we definitely want to keep using simple configuration via the same sidecar files: environment.yml, apt.txt, postBuild, start. And test that built images are compatible with given binderhub and jupyterhub versions.
@tjcrone
Maybe the core of the core of the problem is that we are using repo2docker which doesn't really do what we need it to do, and then another script to sortof fix what repo2docker doesn't do,
r2d_overlay.py does something completely different from repo2docker. The name is only a reference to the fact it uses the same files as repo2docker. It's a way to let users of PANGEO stacks be able to add more packages / postBuild steps without having to learn to write Dockerfiles. This is using Docker's 'ONBUILD' functionality. This means you don't require repo2docker to build user's images - you get similar functionality as repo2docker but without having to use a Dockerfile.
In general, everything repo2docker does is predicated on having users not have to write Dockerfiles for common operations. This is a level of abstraction, but is required if you don't want users to not have to learn Dockerfile.
You can replace repo2docker with Dockerfiles, but that's a different issue than replacing r2d_overlay.py.
@scottyhq
One advantage I also anticipate for sticking to just Dockerfiles is faster build times though the use of
docker build --cache-from
. I've noticed thatonbuild
images usingr2d_overlay.py
currently rebuild even if the configuration is unchanged. See pangeo-data/pangeo-stacks/actions/runs/35778209, where just thepangeo-notebook
environment.yml is changed.pangeo-base
builds in 1m 38s because the configuration is unchanged from pangeo-base:latest and cache is used, butpangeo-esip
takes the usual 14m 9s.
This is a limitation of how we use ONBUILD - we're using a single ONBUILD instruction to do all the files (environment.yml, etc). The ONBUILD
already comes from a Dockerfile.
If you wanna re-use cache here, you need to generate your environment.yml files some other way - perhaps with a python file that merges them manually to generate new environment.yml files. This might be something to try with repo2conda2docker!
If not using repo2docker we definitely want to keep using simple configuration via the same sidecar files: environment.yml, apt.txt, postBuild, start. And test that built images are compatible with given binderhub and jupyterhub versions.
<3. I think having multiple independent implementations of these is extremely important, and I'm excited to see where this goes!
However, note that this is unrelated to r2d_overlay.py, since that deals only with ONBUILD images for downstream users. While something like repo2conda2docker might help with making this particular repo's images build faster, you still need something like r2d_overlay.py to let downstream users inherit from pangeo stacks, add a few packages / run postBuild without having to learn to write Dockerfiles.
I've given merge rights to repo2conda2docker to @scottyhq, @tjcrone and @jhamman. Happy to move it to a different org if needed too.
After a bit of discussion with @tjcrone plus some topics that have come up on pangeo community meetings, wanted to post some ideas for simplifying this repository and pangeo images in general:
1) @jcrist suggested creating a conda-metapackage for the 'minimal pangeo environment' to run on hubs with dask-kubernetes. Users could use this to create their own compatible environments with custom libraries. If anyone wants to take this on it seems like a great idea! https://docs.conda.io/projects/conda-build/en/latest/resources/commands/conda-metapackage.html Once created we could just specify pangeo-core==1.0 in the base-environment configuration (https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/environment.yml).
2) The
onbuild
system combined with repo2docker layered conda environments is very confusing. I think maintaining compatibility with repo2docker is a good idea for compatibility with different binderhub deployments. But, can we just publishonbuild
images and drop the-onbuild
name? This would half the number of options out there and reduce confusion over which to use (https://pangeo-data.github.io/pangeo-stacks/images.html#pangeo-pangeo-notebook)3) If we could figure out a way to get PR images into a public image repo, or stored as build artifacts, a less error-prone and faster approach to pushing master images to dockerhub would be just to relabel them (for example
docker tag pangeo/base-notebook:PR129 pangeo/base-notebook:latest
): https://dille.name/blog/2018/09/20/how-to-tag-docker-images-without-pulling-them/4) Other thoughts?
Would welcome feedback from @yuvipanda, @jhamman, @rabernat, @rsignell-usgs, @ocefpaf