pangeo-data / pangeo-docker-images

Docker Images For Pangeo Jupyter Environment
https://pangeo-docker-images.readthedocs.io
MIT License
128 stars 92 forks source link

ML image update #188

Closed dhruvbalwada closed 1 year ago

dhruvbalwada commented 3 years ago

Was talking to @scottyhq about using the ML image over here and having pytorch preloaded. I know @rabernat has asked about this before (#179) .

We were wondering who all are using the ML image? and what might be the requirements they have? @nbren12 @jhamman It seems like the usage for the ML image is low based on the pulls here: https://github.com/pangeo-data/pangeo-docker-images.

Since pytorch and tensorflow are two of the big candidates,(and maybe used independently usually), @scottyhq suggested having a pangeo-pytorch and a pangeo-tensorflow.

Any other thoughts that people have?

rabernat commented 3 years ago

Correct, we are not using them much right now. However, there are several project spinning up now that will require ML Pangeo images, so it's a good time to think about this.

IMO, before creating more images, we need to make a plan to address how to maintain these images sustainably going forward. Within a month or so we should have a dedicated, full-time Pangeo engineer at 2i2c, and that person should be able to help out with this.

nbren12 commented 3 years ago

I don’t use these images.

My $0.02: the many images problem is a symptom of a docker not being a package manager. Dockerfiles are a linear sequence of commands while packages form a dependency graph. It will always be hard to map docker images onto the packages people want.

Maintaining multiple images is painful. Honestly, for scientific workflows with GB/TB scale datasets, “light” containers don’t seem worth the trouble. If you can get away with it, I suggest 1 mega docker image (you need to pin of all package versions or it will constantly break) or leveraging a tool like repo2docker if you need multiple images. You can also e.g. have packages installed when a user starts a container like the dask image does.

nbren12 commented 3 years ago

It looks like this repo already uses repo2docker, so maybe the tooling is good enough to support many images 🤷 . Maybe pin the “from -image” statements as well to keep things more reproducible.

scottyhq commented 3 years ago

@nbren12 thanks for the comments. This repo is a bit confusing to understand, despite the tags images are in theory reproducible thanks to using conda-lock to presolve for the environment added to the docker image, so for example to recreate an image from the past:

git clone https://github.com/pangeo-data/pangeo-docker-images.git
cd pangeo-docker-images
git checkout 2020.09.30
docker build -t pangeo/base-image:master base-image
docker build -t pangeo/ml-notebook:2020.09.30 ml-notebook

GPU-enabled ML packages are hard to cram into the same conda environment though in our attempts so far, which is why perhaps it's best to pick either tensorflow or pytorch. Preferably we have someone actively using the image responsible for curating the packages. Not sure who that would be these days?

It will always be hard to map docker images onto the packages people want.

Couldn't agree more. Although we've gotten a lot of mileage out of people using a common environment on pangeo hubs. For long term sustainability though, someone will need to tackle allowing users to customize their environment: https://github.com/pangeo-data/pangeo-docker-images/issues/148

nbren12 commented 3 years ago

Ah yes. I see the lock files now.

GPU-enabled ML packages are hard to cram into the same conda environment though in our attempts so far

Interesting. What's the main barrier? Package versions not resolving?

scottyhq commented 3 years ago

Interesting. What's the main barrier? Package versions not resolving?

Yeah. For example trying adding pytorch-gpu and jax in #179 https://github.com/pangeo-data/pangeo-docker-images/runs/1712185623?check_suite_focus=true

It seems like the general guidance is not to mix conda channels (ideally everything comes from conda-forge with the 'strict' channel priority setting). But to get the GPU-enabled packages we've had to relax that setting (https://github.com/pangeo-data/pangeo-docker-images/blob/master/ml-notebook/condarc.yml) and point to packages on specific channels: https://github.com/pangeo-data/pangeo-docker-images/blob/b6e6b19cf6890ce56010e3ef7a7584f49bda3198/ml-notebook/environment.yml#L11-L14

nbren12 commented 3 years ago

Good to know. This topic provokes so much in me---I've spent a lot of time maintaining developer environments. I've been interested in a package manager called nix which is basically a more composable docker. I hope it picks up steam in the next few years.

rabernat commented 3 years ago

For some context, I will share the amazing blog post Noah recently published on this topic! https://www.noahbrenowitz.com/post/2021-version-pinning/

It's a hard problem, but one we should keep plugging away at. We don't have a perfect solution yet, but we have made good progress!

scottyhq commented 3 years ago

love the post @nbren12 this one is also worth checking out for tips on reducing image size https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html

weiji14 commented 1 year ago

Closing this as we've added a pytorch-notebook image in #315. See also discussion at #457 on optimizing the ml-notebook (tensorflow) and pytorch-notebook images further for GPU-accelerated workflows.