Use cudatoolkit=11 in both tensorflow and pytorch images

scottyhq commented 2 years ago

Currently tensorflow (ml-notebook) uses cudatoolkit=10, and the new pytorch image uses cudatoolkit=11. It would be good to keep those versions in sync. Full discussion here: https://github.com/pangeo-data/pangeo-docker-images/pull/315#pullrequestreview-955549154

ngam commented 2 years ago

@scottyhq although I have access to pangeo hub, I haven't really been able to get going with usage (I do have significant allocated compute resources elsewhere and so I don't want to squeeze shared resources). However, if I remember correctly, the GPUs on pangeo hub were quite old, like M6 or Q 4000 or something like that --- is that still the case?

As you rightly pointed out in the other discussion, the cuda-related software is pretty huge and so unless the hardware is equipped to handle cudatoolkit=11+, I wouldn't necessarily update.

Another related issue (perhaps for another discussion):

It may be useful to think about splitting the containers directions in the future. Maintaining the GPU-optimized builds has been really challenging on conda-forge (for example, tensorflow 2.9 is about to be released and we haven't been able to build 2.8 yet...). In my current research work, I tend to only do the preliminary testing/exploratory in a conda/pip/etc environment, but once I get going, I very quickly move to working with NGC containers. This usually yields significant performance improvement in my experience.

The point is: it may be useful to build the ML-GPU images on top of NGC containers instead of the other containers here. The NGC containers come equipped with optimizations for GPUs and we could then add the rest of the software stack on top of that. In the case of tensorflow, likely it would be done via pip; but I believe the NGC container for pytorch uses conda by default --- I haven't used it in a few months, so not sure if they changed. Either way, I would strictly follow whatever installation technique they used (this is pip-esque in the tensorflow containers)

Anyway, just a thought. While per-instance performance improvement might be small, if we are talking about large-scale deployments with a lot of researchers and a lot of academics using these containers, that can be quite significant (e.g. saving power, resources, time, etc.)

rabernat commented 2 years ago

I remember correctly, the GPUs on pangeo hub were quite old, like M6 or Q 4000 or something like that

There are many different Pangeo hubs, operated by different communities. The LEAP Pangeo Hub run by 2i2c has K80s.

it may be useful to build the ML-GPU images on top of NGC containers instead of the other containers here.

This sounds like a great suggestion! But also a considerable amount of work. The challenge is that the rest of the pangeo stack (particularly packages with hdf5 and gdal binary dependencies) does not install well via pip. So if we want an image with both optimized ML libraries AND all of those funky geo packages, it's going to be tricky.

ngam commented 2 years ago

But also a considerable amount of work.

I actually think it may be slightly less work because the NGC people update their images every month with exact versions, etc. --- I will start looking into this to see how far I could go by installing the core deps from the containers here into the NGC containers.

The challenge is that the rest of the pangeo stack (particularly packages with hdf5 and gdal binary dependencies) does not install well via pip. So if we want an image with both optimized ML libraries AND all of those funky geo packages, it's going to be tricky.

YES, this is the really hard part. However, we might be able to pull it off! I will start some preliminary work and report back.

There are many different Pangeo hubs, operated by different communities. The LEAP Pangeo Hub run by 2i2c has K80s.

I think we should be careful how far we go optimizing the software without paying attention to the hardware... The NGC containers support Ampere, Turing, Volta, and Pascal only (at least according to their website). The base images are pretty substantial fwiw, around 15GB.

rabernat commented 2 years ago

The base images are pretty substantial fwiw, around 15GB.

🤯 that's huge!

Depending on the NGC containers would be a major refactor for pangeo docker images. But I'm excited by your enthusiasm and would welcome this improvement. 🚀 Let's definitely give it a try and see how far we can get.

ngam commented 2 years ago

There are many different Pangeo hubs

Ooops, sorry, I meant pangeo cloud: https://pangeo.io/cloud.html

ngam commented 2 years ago

Let's definitely give it a try and see how far we can get

I started a new repo to investigate building on top of NGC containers: ngam/ngc-ext-pangeo (second PR there: https://github.com/ngam/ngc-ext-pangeo/pull/2). I think I've isolated a few of the problematic packages (e.g. cartopy) and will likely need to figure out a way around that (i.e. build proj, geos, etc. libs from scratch). Baby steps at first, but we will see how far we get. I will add tests and other customizations later (or even better, maintainers and contributors here can do a much better job than me!)

The base images are pretty substantial fwiw, around 15GB.

🤯 that's huge!

I know... I never really work with the Docker images themselves. I convert them into Singularity SIF images because of the HPC policies, and for some reason, the size of the tensorflow image ends up being closer 6 GB only as opposed to the 14--15 GB Docker size. I've just submitted a Singularity image build; only adding netcdf4 on top of the NGC container results in this https://cloud.sylabs.io/library/ngam/nvtf/2204 which is just shy of 6 GB

ngam commented 2 years ago

A smaller update and a correction:

I've got almost everything in place with only a few deps remaining to be resolved, now time to fine-tune this and fix corner cases, but also make it more like a legit workflow with checks, and optimizations, etc.
Above, I took the 15GB figure from the ngc website, they are actually only 6GB compressed (same as the sif files) which is in the same region as the current pytorch container here.

rabernat commented 2 years ago

Fantastic! Thanks for the update @ngam! Let us know how we can help or whenever you might be ready for someone else to test drive your images.

weiji14 commented 2 years ago

The NGC containers come equipped with optimizations for GPUs and we could then add the rest of the software stack on top of that.

I'd be happy to give this a whirl once you have something ready @ngam! Would really be good to see some benchmarks on NGC vs the current Pangeo docker stack. Optimizing for different NVIDIA GPU generations (e.g. K80s, P100s, V100s, A100s, T4s, H100s, etc) is a tricky business, and I'm wondering what those NGC containers are currently designed for (my guess is the newer hardware, i.e. A100 and up). There's even a whole field called Hardware-Aware Neural Architecture Search (https://arxiv.org/abs/2101.09336) that's all about optimizing neural network architectures for low latency on different devices (e.g. mobile, CPU, GPU, etc), but that's handling things higher up in the ML stack whereas your work would help with the optimizing the baseline (i.e. squeezing out the GPU device's true performance).

My only other concern is that the NGC containers are a little more proprietary/less transparent that the current stack. They do publish the docker layers at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/layers, but it requires a login to view. That said, if those optimizations results in a significant reduction in energy usage/time savings, I'm sure the climate modelling folks would be happy :smile:

ngam commented 2 years ago

My only other concern is that the NGC containers are a little more proprietary/less transparent that the current stack

Someone with more knowledge / experience would need to address this, not me 😅 But as far as I could tell, it's the same license. When one installs cudatoolkit from conda-forge, they print out that message, right? Also, you may think all of conda-forge is open-source --- nope! A lot of software from nvidia and intel is actually simply repackaged, i.e. binary repackaging, not even rebuilt from source. This is a frustrating issue for people who actually care about these things, but if I understand correctly, it makes no difference at all for academic/research usage, which is my area, and so I never pay close attention. At any rate, even with my skills of sleuthing around the internet, I couldn't find how exactly they build these containers, but I think I have a decent idea how based on playing around. The way they build the more basic cuda containers is a little more open (available on gitlab).

Optimizing for different NVIDIA GPU generations (e.g. K80s, P100s, V100s, A100s, T4s, H100s, etc) is a tricky business, and I'm wondering what those NGC containers are currently designed for (my guess is the newer harware, i.e. A100 and up).

According to their website: "Ampere, Turing, Volta, and Pascal". But I agree with your guess, I would say most of the more modern optimizations won't make a difference for anything lower than sm 7.0, so I would say these NGC containers won't make much of a difference for anything besides Volta and Ampere, maybe T4 as well...

ngam commented 2 years ago

I will try to upload some images later this week. We can at least document the process for interested community members if they have access to V100 or A100 GPUs and want some more performance!

ngam commented 2 years ago

xref: https://github.com/ngam/ngc-ext-pangeo/issues/24, docker pull ngam00/ngc-pt-pangeo to pull an image, and there are still missing packages: https://github.com/ngam/ngc-ext-pangeo/issues/21

ngam commented 2 years ago

Another note: last night, I've started an effort in conda-forge to build packages using singularity (as an alternative to docker). This way people, like me, who have access to vast computational resources but with no docker can help build complex packages that are not possible under the 6-hour limit for open-source repos on public ci. For now, we rely on the goodwill and generosity of maintainers to build locally, meaning spending literally days to build packages like tensorflow and pytorch, upload them separately, review the logs manually, and then move them to the conda-forge channel. On an HPC, we can cut that by 10x or so.

I also am trying to add the optimizations from these ngc containers in the build process to the conda-forge's recipes, so the gap between the two should significantly narrow going forward. And with the singularity build process, I am hoping people can help lessen the strain on the main maintainers by offering to build on their HPCs. (Fwiw, the cpu tensorflow builds from conda-forge outperform other builds somehow, so we may end up really making up ground and maybe improving on the ngc containers performance if all goes well!)

ngam commented 2 years ago

Some results were provided by @weiji14 in https://github.com/ngam/ngc-ext-pangeo/issues/24

https://www.diffchecker.com/ZTpD1Par.

I've squeezed XLA support in conda-forge's tensorflow 2.8.1, and I am slowly trying to copy all the optimizations from the NGC containers to conda-forge (but with theses big projects, it will take some time).

ngam commented 2 years ago

In #345 I push tf to cuda112 and added pytorch to the ML notebook (not proposing removing the pytorch notebook for now, so not disrupt people's workflows).

Quite soon, we should be able to add jaxlib==*=*cuda* to the mix. The issue with that is the icu and absl migrations --- these lead to wide and confusing conflicts, especially with the very complex geo packages. However, I am closing on getting them all in order.

Feel free to drop the pytorch part in #345 to keep it simple, but it is absolutely doable to combine all of them into one notebook. Note the size won't change much, the added pytorch will be just a few 100s MB

pangeo-data / pangeo-docker-images

Use cudatoolkit=11 in both tensorflow and pytorch images #320