Using Nvidia CUDA with repo2docker

jzf2101 commented 6 years ago

Related to https://github.com/jupyterhub/team-compass/issues/52 and https://github.com/jupyterhub/team-compass/issues/85 and the fork from @spMohanty that includes CUDA in the base image

@choldgraf was curious about how we could vary the base image in r2d. @minrk thought we could create an additional configuration file to specify the base image eg runtime.txt we should probably require ubuntu as the OS?

This was mentioned in this month's meeting

betatim commented 6 years ago

Should we re-title this issue to "Uses-cases for changing the base image"? Otherwise this will become a mixture of "how to do GPU support" and "base images for X and Y and Z".

spMohanty commented 6 years ago

@jzf2101 : Good to seem some momentum around this. repo2docker with GPU support is something we have been actively using on crowdAI. We managed to collect about 800 repositories (and numerous tags in each of the repositories), amounting to ~1TB of code, that are repo2docker (the crowdai-fork) compatible , for two of the NIPS competitions this year that I was co-organizing. And we have a few more challenges on crowdAI coming up which will aggresively use the same setup.

The way we do it, is very hacky, and non-standard. We simply change the base image to nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04. Which ties all submissions to cuda-9.0, and cudnn7, while in many cases, users would want to choose which cuda version and cudnn version they want to build against.

Unfortunately, given the licensing situations with cuda and cudnn, the only way we can have cuda/cudnn in built images, if we build on top of the official base images released by nvidia : https://hub.docker.com/r/nvidia/cuda/tags/ All of them are built on top of Ubuntu 14.04/16.04/18.04.

We have seen some weird behaviour in terms of reproducibility with building dockerimages with these images as the base image, as for some reason, they abruptly update the old tags, sometimes breaking production code, or some weird dependencies. Wish they versioned the images better. But this would be an important consideration from repo2dockers point of view.

So ideally, in context of GPU support, the configurations that we would need inruntime.txt would be the cuda version, and the cudnn version, when present, the buildpack should try to build a GPU compatible image (with an appropriate message ofcourse).

And we would really love to see this released as a part of repo2docker. our fork is already 324 commits behind, and as a small team focused on quite a few things, it might not be possible to keep catching up with repo2docker all the time :D

But as I mentioned, its great to see some momentum around GPU support here, and would love to see this shipped with repo2docker soon :D

betatim commented 6 years ago

The Dockerfile shown in https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994#issue-373992464 suggests that we do not need to switch base images to be able to support running on a Nvidia GPU. Is there something missing from the linked Dockerfile (it has only been tested on GKE)?

If that Dockerfile also works on a local machine (maybe with nvidia-docker) then this isn't a use case that motivates adding the ability to switch base images.

spMohanty commented 6 years ago

@betatim : I quickly skimmed through it, but I think I understand whats hapenning with it.

In case of GPU support for the k8 service on GKE, which I think is still in beta mode, mounts the cuda drivers from the host nodes. GKE ensures that the cuda drivers are all accessible at a particular location (usually /usr/local/nvidia) on the host machine, so that the k8 pods can mount them on the go.

But thats something, which would be hard for us (and many other repo2docker users) to use in the long run, because there you basically pin a whole cluster to a particular cuda version. And this might have many unintended side effects, for instance the tensorflow-gpu pip packages was not supported with cuda-9.1, and users might want to explicitly choose the cuda and cudnn versions they want to run their code against, which might have huge impacts on the performance of their code.

Hence, in an ideal case, we would want each repository to specify its own cuda/cudnn versions, and build off from a base image which have the necessary drivers in the image itself.

And that Dockerfile will not work on local machine, unless you ensure that cuda is present at the correct location, and is mounted onto the container during runtime at the appropriate location(s).

jzf2101 commented 6 years ago

@betatim suggests we may need to call:

repo2docker --no-run https://myrepo.com will build an image from the repo that you can then run with nvidia-docker run <name-of-image-here>

spMohanty commented 6 years ago

@jzf2101 : Ahh ! well nvidia-docker will install the basic nvidia drivers depending on the gpu on the host machine, but not the full cuda toolkit. The nvidia drivers enable the container to "see" the GPUs on the host machine, but they still expect a CUDA capable container to be able to use the GPUs in any sensible way.

https://devtalk.nvidia.com/default/topic/1033038/cuda-setup-and-installation/does-nvidia-docker-install-cuda-and-nvidia-driver-in-docker-/

And the licensing of CUDA (and cudnn, etc), makes it very non trivial to efficiently package them into prebuilt images, hence most people just start off the official nvidia/cuda base images. Installing cuda through scripts etc, would require you to expressly agree to the terms of usage by nvidia during the installation, which cannot be abstracted away in automated build processes (at least in a legal way).

jzf2101 commented 6 years ago

So you're not calling the nvidia runtime when you run crowdai-repo2docker?

spMohanty commented 6 years ago

@jzf2101 : No, when we use crowdai-repo2docker, then we do not even need nvidia-docker, the goal there is to build a CUDA capable docker image.

When we run the docker image, then we need nvidia-docker if running locally, or the nvidia-gpu-device-plugin daemonset on a k8 cluster (while requesting for nvidia/gpu:1).

choldgraf commented 6 years ago

what if we did the following:

Added a new parameter to repo2docker, something like --base-image that for now would accept something like org/image:tag.
This image, if provided, would replace what's in the buildpack-deps image: https://github.com/jupyter/repo2docker/blob/master/repo2docker/buildpacks/base.py#L13

We provide a warning if this is provided that says something like

Warning: Alternate base image provided. Reproducibility or compatibility with repo2docker build-packs is not guaranteed. repo2docker will only work with base images running <LIST-OF-REQUIRED-THINGS>.

Add a documentation page that more explicitly lays out the requirements in the base image.

What do folks think?

jzf2101 commented 6 years ago

FWIW I think we should provide documentation on how to also avoid swapping the image in the case that you DO want CDA drivers.

betatim commented 6 years ago

To use your GPU there are two things that need doing: install various things inside your container and install various things on the host.

repo2docker can't help you with installing things on the host. In the comment I linked to above they are running on GKE and use some GKE specific way to install things on all the nodes in the cluster. I'd expect users to take care of all that "somehow".

The next question is how do we get stuff into the docker image. This is where a lot of people use the Nvidia image as a base image.

Taking a look at the Dockerfile for nvidia/cuda it appears to install some apt packages and set some environment variables. What I'd like to confirm is if the Dockerfile in this comment does the same as the Nvidia Dockerfile does. My guess is the answer is yes, but I don't have access to a GPU to check.

# For the latest tag, see: https://hub.docker.com/r/jupyter/datascience-notebook/tags/
FROM jupyter/datascience-notebook:f2889d7ae7d6

# GPU powered ML
# ----------------------------------------
RUN conda install -c conda-forge --yes --quiet \
    tensorflow-gpu \
    cudatoolkit=9.0 && \
    conda clean -tipsy && \
    fix-permissions $CONDA_DIR && \
    fix-permissions /home/$NB_USER

# Allow drivers installed by the nvidia-driver-installer to be located
ENV LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
# Also, utilities like `nvidia-smi` are installed here
ENV PATH=${PATH}:/usr/local/nvidia/bin

Instead of installing apt packages it uses conda-forge packages. This is easier for us to do in repo2docker as we currently do not support adding extra PPAs before installing user specified apt packages.

This would give the user control over which version of CUDA to install inside the image no? It would have to match the drivers installed on the host but that is also the case when you use the Nvidia provided image.

jzf2101 commented 6 years ago

@parente @consideratio can you confirm? @betatim https://github.com/jupyter/docker-stacks/issues/745 also https://github.com/conda-forge/conda-forge.github.io/issues/63#issuecomment-403079964 discusses a lot of the things you have observed

I have a VM with a GPU and docker set up if you want me to figure out if they're equivalent I just don't know how to test it, @betatim

betatim commented 6 years ago

Could you build the Dockerfile I gave above and then start it. Then checkout step 7 of https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994#issue-373992464 to see if your container can see the GPU and if yes if the tensorflow example notebooks run?

betatim commented 5 years ago

For the neurips.mybinder.org deployment (Binderhub with GPUs) we managed to deploy several repos without making any change to repo2docker. Instead we installed a few extra libraries via existing mechanisms.

This made me think: can we support the CUDA use-case via a build pack that installs additional packages and sets environment variables? We'd need a way to trigger the build pack.

jzf2101 commented 5 years ago

I think that's a great idea

choldgraf commented 5 years ago

I think we can use the neurips blog post to structure the stuff that we did in a way that'll make it easier to decide what changes could be made to repo2docker (e.g. a buildpack) to accomodate this. One of the reasons I'd like to get that post ready sooner than later is so that we don't forget all the stuff we did to make it work :-)

rbavery commented 5 years ago

Hey all, I'm interested in using repo2docker to build a cuda enabled image from an existing github repo. Any updates on this issue?

betatim commented 5 years ago

You can build GPU enabled images with repo2docker by installing the cudatoolkit conda package. Check out the demo at https://github.com/jzf2101/GAN_tutorial/tree/gpu-binder which we used at NeurIPS. No changes to repo2docker were required.

rbavery commented 5 years ago

Ah awesome, thanks @betatim

ctr26 commented 3 years ago

You can build GPU enabled images with repo2docker by installing the cudatoolkit conda package. Check out the demo at https://github.com/jzf2101/GAN_tutorial/tree/gpu-binder which we used at NeurIPS. No changes to repo2docker were required.

This solution doesn't work for me and I have no clue really why. When I request a gpu node on my cluster nvidia-smi does become available, but with the cuda toolkit installed via conda alone PyTorch does not see any Nvidia driver.

For my pods to work with pytorch they need the cuda compatibility packages installed from Nvidia's private package source, I've tested this a lot today, cutting parts out of dockerfiles and rebuilding.

My solutions are:

Changing the base image of repo2docker to Nvidia cuda from the bionic build pack

Or somehow add additional sources (as in the Nvidia cuda docker image) https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/11.2.1/ubuntu18.04-x86_64/base/Dockerfile

Or run a postbuildadmin script (doesn't exist yet?) inrm repo2docker

Or use the crowdai repo2docker image instead of repo2docker (currently doesn't work)

Or use a Dockerfile and circumvent most of repo2docker (less desirable)

Do you guys have anymore ideas?

ctr26 commented 3 years ago

To use your GPU there are two things that need doing: install various things inside your container and install various things on the host.

repo2docker can't help you with installing things on the host. In the comment I linked to above they are running on GKE and use some GKE specific way to install things on all the nodes in the cluster. I'd expect users to take care of all that "somehow".

The next question is how do we get stuff into the docker image. This is where a lot of people use the Nvidia image as a base image.

Taking a look at the Dockerfile for nvidia/cuda it appears to install some apt packages and set some environment variables. What I'd like to confirm is if the Dockerfile in this comment does the same as the Nvidia Dockerfile does. My guess is the answer is yes, but I don't have access to a GPU to check.
# For the latest tag, see: https://hub.docker.com/r/jupyter/datascience-notebook/tags/
FROM jupyter/datascience-notebook:f2889d7ae7d6

# GPU powered ML
# ----------------------------------------
RUN conda install -c conda-forge --yes --quiet \
    tensorflow-gpu \
    cudatoolkit=9.0 && \
    conda clean -tipsy && \
    fix-permissions $CONDA_DIR && \
    fix-permissions /home/$NB_USER

# Allow drivers installed by the nvidia-driver-installer to be located
ENV LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
# Also, utilities like `nvidia-smi` are installed here
ENV PATH=${PATH}:/usr/local/nvidia/bin
Instead of installing apt packages it uses conda-forge packages. This is easier for us to do in repo2docker as we currently do not support adding extra PPAs before installing user specified apt packages.

This would give the user control over which version of CUDA to install inside the image no? It would have to match the drivers installed on the host but that is also the case when you use the Nvidia provided image.

@betatim I have a gpu enabled binderhub I'm testing on, I could give you access if you would like to play? Running this Dockerfile now.

ctr26 commented 3 years ago


FROM tensorflow/tensorflow:latest-gpu-jupyter

# --- Jupyter

# install the notebook package
RUN pip install --no-cache --upgrade pip && \
    pip install --no-cache notebook

# create user with a home directory
ARG NB_USER
ARG NB_UID
ENV USER ${NB_USER}
ENV HOME /home/${NB_USER}

RUN adduser --disabled-password \
    --gecos "Default user" \
    --uid ${NB_UID} \
    ${NB_USER}
WORKDIR ${HOME}
USER ${USER}

# RUN conda install pip --yes

COPY . .

RUN pip install --no-cache-dir -r requirements.txt

Also to note that this does work

ctr26 commented 3 years ago

@betatim

Would you be interested in testing your solution on a gpu enabled binderhub?

adriendelsalle commented 3 years ago

I ran into similar questions those days, I'll try to contribute to this discussion.

Should we re-title this issue to "Uses-cases for changing the base image"? Otherwise this will become a mixture of "how to do GPU support" and "base images for X and Y and Z".

It looks like there are multiple subjects:

host configuration
spawner configuration
image definition

1. Host configuration

The host configuration is what is installed bare metal. Depending on how you get the hardware, you may have to handle that configuration by yourself or even be unable to change anything on the host:

infrastructure/platform as a service from a cloud provider
- k8s GPU nodes may be pre-configured or need some extra action (for example https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers)
- depends on the provider
bare metal servers, VMs, k8s clusters or local instances: everything as to be installed

What has to be installed on the host is basically:

nvidia drivers
nvidia-container-runtime or similar (see documentation for k8s)

2. Spawner configuration

To keep JHub terminology, let's call a spawner the engine/tool used to run a container from an image. It has to give access to the host GPU(s) resource(s).

repo2docker offers capability to run images using docker engine. By default, docker is not exposing GPU devices to the containers (docker run ubuntu:focal nvidia-smi won't work on a host having a GPU device)

using docker CLI, one can pass --gpus extra option as documented
- without passing --gpus <value>, NVIDIA_VISIBLE_DEVICES env var is unset/empty making nvidia-container-runtime working as runc. See nvidia-container-runtime documentation
- alternatively, nvidia images take care of setting this, but also provide CUDA libraries (see the next section).
using repo2docker CLI, I just created a PR to implement it (using docker Python API) https://github.com/jupyterhub/repo2docker/pull/1079

For k8s use cases, it probably has to be handled on downstream repos such as jupyterhub or binderhub with:

respect to specific use cases
they may require some feature upstream (in this repo)
an effort to find common/shared solutions to the same issues, re-use code, etc.

3. Image definition

Images need to be capable of running code (e.g. from a notebook) on GPU device(s).

Depending on the library used (pytorch, tensorflow, etc.), you may need extra dependencies to be installed to be able to execute your code. Those deps should be pinned by your project (or your dependencies), and installed by the package manager associated to your manifest/spec file (selecting the appropriate BuildPack) at build time.

`CUDA` particular case

Available libraries

Depending on which API level is required (driver, runtime or CUDA libs), you may need various extra packages to be installed:

it's pretty well explained in this thread, to summarize:
- driver API is the low-level API, available from CUDA drivers libcuda
- runtime API is the high-level API, available from CUDA runtime libcudart
- CUDA libraries such as cuFTT/cuBLAS/etc., available from cudatoolkit
for example, pytorch installed with conda needs cudatoolkit to run on GPU
- it's up to the package manager to install it

Base image

This final image will be pretty similar (but surely not equivalent) using:

nvidia base image then installing dependencies from a manifest/spec files for a given BuildPack
- note: it will also have an impact on the spawner, as mentioned previously, because those base images are already setting the right OCI env vars
using that manifest/BuildPack to install both (providing the required tool chain is packaged with it)

Even when using a minimal base image, relying on cudatoolkit (which is pretty extensive) leads to large images.

In term of resources consumption, it would be tempting to think that re-using same layers through nvidia base images is more efficient than storing a lot of huge layers for every package manager installation.

what if we did the following:
Added a new parameter to repo2docker, something like --base-image that for now would accept something like org/image:tag.

This image, if provided, would replace what's in the buildpack-deps image: https://github.com/jupyter/repo2docker/blob/master/repo2docker/buildpacks/base.py#L13
We provide a warning if this is provided that says something like
Warning: Alternate base image provided. Reproducibility or compatibility with repo2docker build-packs is not guaranteed. repo2docker will only work with base images running <LIST-OF-REQUIRED-THINGS>.
Add a documentation page that more explicitly lays out the requirements in the base image.
What do folks think?

cc @choldgraf

If you are using conda eco-system/BuildPack upon nvidia base image it would result in having the same large CUDA libraries installed twice:

at system level (using apt) through the base image
at env level (using conda/mamba)

To actually save space re-using the same layers from nvidia base images, it would probably need more/too complex/risky strategies:

install from sources and/or install some packages/libraries without theirs dependencies to rely on system ones
have different base images for each BuildPack, with cuda libs already installed in base environment
etc.

But then, AFAIK it would also mean doing crazy things like:

interpreting a spec file to select base image (e.g. cudatoolkit found as a conda requirement translate into selecting some nvidia/cuda-jupyter:runtime image)
splitting the recipe to prevent installing libs twice
only first level dependencies could be handled
etc.

Maybe some of you found much more efficient strategies!

ctr26 commented 3 years ago

I have a working example of installing cuda all the requisite libraries using just the conda buildpack at

GitHub.com/ctr26/ZeroCostDl4Mic

My gpu k8s is currently down but it works locally atm.

ymoisan commented 2 years ago

FWIW I manually replace this line in the template with

FROM nvidia/cuda:11.2.0-cudnn8-runtime-ubuntu18.04

Or whatever cuda/cudnn combination is installed on the destination host. We have to add a couple apt installs using that image instead of the default bionic but once that's done all I have to do is toggle that line to build with or without GPU.

It would make sense that some sort of enable_gpu+ cuda_version parameters be used to implement that toggling directly in base.py using the nvidia base images of the Ubuntu version repo2docker is using, for example. Would that be too simplistic ?

ctr26 commented 2 years ago

https://github.com/ctr26/basic-gpu-binder

This works for me on my k8s binderhub deployment

On 9 Dec 2021, at 20:30, Yves Moisan @.***> wrote:

FWIW I manually replace this line in the template with

FROM nvidia/cuda:11.2.0-cudnn8-runtime-ubuntu18.04 Or whatever cuda/cudnn combination is installed on the destination host. We have to add a couple apt installs using that image instead of the default bionic but once that's done all I have to do is toggle that line to build with or without GPU.

It would make sense that some sort of enable_gpu+ cuda_version parameters be used to implement that toggling directly in base.py using the nvidia base images of the Ubuntu version repo2docker is using, for example. Would that be too simplistic ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jupyterhub / repo2docker