osrf / rocker

A tool to run docker containers with overlays and convenient options for things like GUIs etc.
Apache License 2.0
559 stars 73 forks source link

rocker terminates with "unsatisfied condition: cuda>=10.0, please update your driver to a newer version" #139

Closed 130s closed 3 years ago

130s commented 3 years ago

With a risk of this issue rooted in somewhere upstream than rocker, reporting here for now.

Problem

rocker terminates with cuda error. Host is Ubuntu 20.04 while Docker image is Ubuntu 16.04.

$ rocker --nvidia --x11 registry.gitlab.com/ppp/product/foo/baa:brancheee bash
:
docker run -it   --rm     --gpus all  -e DISPLAY -e TERM   -e QT_X11_NO_MITSHM=1   -e XAUTHORITY=/tmp/.docker.xauth -v /tmp/.docker.xauth:/tmp/.docker.xauth   -v /tmp/.X11-unix:/tmp/.X11-unix   -v /etc/localtime:/etc/localtime:ro    7375ab5ddb0c bash
/usr/bin/docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=10.0, please update your driver to a newer version, or use an earlier cuda container\\\\n\\\"\"": unknown.

Looks like the Docker image comes with cuda 7.6.

$ docker inspect registry.gitlab.com/ppp/product/foo/baa:brancheee | grep -i nvidia
:
                "NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411",
                "com.nvidia.cudnn.version": "7.6.5.32",
More complete output ``` $ rocker --nvidia --x11 registry.gitlab.com/ppp/product/foo/baa:brancheee bash Active extensions ['nvidia', 'x11'] Step 1/12 : FROM python:3-stretch as detector ---> b9d77e48a75c Step 2/12 : RUN mkdir -p /tmp/distrovenv ---> Running in 27e386b5e745 ---> a575d3c2c3c3 Step 3/12 : RUN python3 -m venv /tmp/distrovenv ---> Running in 1c4461986583 ---> a9314a141e7d Step 4/12 : RUN . /tmp/distrovenv/bin/activate && pip install distro pyinstaller==4.0 staticx ---> Running in fd905649549d Collecting distro Downloading https://files.pythonhosted.org/packages/25/b7/b3c4270a11414cb22c6352ebc7a83aaa3712043be29daa05018fd5a5c956/distro-1.5.0-py2.py3-none-any.whl Collecting pyinstaller==4.0 Downloading https://files.pythonhosted.org/packages/82/96/21ba3619647bac2b34b4996b2dbbea8e74a703767ce24192899d9153c058/pyinstaller-4.0.tar.gz (3.5MB) Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing wheel metadata: started Preparing wheel metadata: finished with status 'done' Collecting staticx Downloading https://files.pythonhosted.org/packages/2d/91/29baddb74148c1140bf00515ec6f55afc64267e4b925997fb9041913633f/staticx-0.12.1-py3-none-manylinux1_x86_64.whl (154kB) Collecting pyinstaller-hooks-contrib>=2020.6 (from pyinstaller==4.0) Downloading https://files.pythonhosted.org/packages/27/c7/58a634d861e4744ac62dca4a4992ace8def8b05dab91e6b25e5043e79acf/pyinstaller_hooks_contrib-2021.1-py2.py3-none-any.whl (181kB) Collecting altgraph (from pyinstaller==4.0) Downloading https://files.pythonhosted.org/packages/ee/3d/bfca21174b162f6ce674953f1b7a640c1498357fa6184776029557c25399/altgraph-0.17-py2.py3-none-any.whl Requirement already satisfied: setuptools in /tmp/distrovenv/lib/python3.7/site-packages (from pyinstaller==4.0) (40.8.0) Collecting pyelftools (from staticx) Downloading https://files.pythonhosted.org/packages/6f/50/3d7729d64bb23393aa4c166af250a6e6f9def40c90bf0e9af3c5ad25b6f7/pyelftools-0.27-py2.py3-none-any.whl (151kB) Building wheels for collected packages: pyinstaller Building wheel for pyinstaller (PEP 517): started Building wheel for pyinstaller (PEP 517): finished with status 'done' Stored in directory: /root/.cache/pip/wheels/cb/91/a9/1e2b69cf9e01f0f6a89c2c6166324319ca6273d26604b200b6 Successfully built pyinstaller Installing collected packages: distro, pyinstaller-hooks-contrib, altgraph, pyinstaller, pyelftools, staticx Successfully installed altgraph-0.17 distro-1.5.0 pyelftools-0.27 pyinstaller-4.0 pyinstaller-hooks-contrib-2021.1 staticx-0.12.1 You are using pip version 19.0.3, however version 21.0.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. : # if the path is alreaady present don't fail because of being unable to append RUN ( echo '/usr/local/lib/x86_64-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && ldconfig || grep -q /usr/local/lib/x86_64-linux-gnu /etc/ld.so.conf.d/glvnd.conf ) && \ ( echo '/usr/local/lib/i386-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && ldconfig || grep -q /usr/local/lib/i386-linux-gnu /etc/ld.so.conf.d/glvnd.conf ) ENV LD_LIBRARY_PATH /usr/local/lib/x86_64-linux-gnu:/usr/local/lib/i386-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} COPY --from=glvnd /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json ENV NVIDIA_VISIBLE_DEVICES ${NVIDIA_VISIBLE_DEVICES:-all} ENV NVIDIA_DRIVER_CAPABILITIES ${NVIDIA_DRIVER_CAPABILITIES:+$NVIDIA_DRIVER_CAPABILITIES,}graphics # Snippet from extension [x11] ^^^^^^ Building docker file with arguments: {'path': '/tmp/tmpxy2ov0s4', 'rm': True, 'nocache': False, 'pull': False} building > Step 1/12 : FROM nvidia/opengl:1.0-glvnd-devel-ubuntu16.04 as glvnd building > ---> 6424ab2e587b building > Step 2/12 : FROM registry.gitlab.com/ppp/product/foo/baa:brancheee building > ---> 790ff34c9564 building > Step 3/12 : USER root building > ---> Running in d00eb4c3dbe3 building > Removing intermediate container d00eb4c3dbe3 building > ---> 700b2e69d9fb building > Step 4/12 : COPY --from=glvnd /usr/local/lib/x86_64-linux-gnu /usr/local/lib/x86_64-linux-gnu building > ---> d77cb14752b0 building > Step 5/12 : COPY --from=glvnd /usr/local/lib/i386-linux-gnu /usr/local/lib/i386-linux-gnu building > ---> 8a81719ec78b building > Step 6/12 : COPY --from=glvnd /usr/lib/x86_64-linux-gnu /usr/lib/x86_64-linux-gnu building > ---> 91c8564a52e6 building > Step 7/12 : COPY --from=glvnd /usr/lib/i386-linux-gnu /usr/lib/i386-linux-gnu building > ---> 686534c50047 building > Step 8/12 : RUN ( echo '/usr/local/lib/x86_64-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && ldconfig || grep -q /usr/local/lib/x86_64-linux-gnu /etc/ld.so.conf.d/glvnd.conf ) && ( echo '/usr/local/lib/i386-linux-gnu' >> /etc/ld.so.conf.d/glvnd.conf && ldconfig || grep -q /usr/local/lib/i386-linux-gnu /etc/ld.so.conf.d/glvnd.conf ) building > ---> Running in f13f5fa17b3b building > Removing intermediate container f13f5fa17b3b building > ---> aa5f6b44ef19 building > Step 9/12 : ENV LD_LIBRARY_PATH /usr/local/lib/x86_64-linux-gnu:/usr/local/lib/i386-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} building > ---> Running in 0c078922802f building > Removing intermediate container 0c078922802f building > ---> bb8fef320fc4 building > Step 10/12 : COPY --from=glvnd /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json /usr/local/share/glvnd/egl_vendor.d/10_nvidia.json building > ---> 3f8aaf0dd992 building > Step 11/12 : ENV NVIDIA_VISIBLE_DEVICES ${NVIDIA_VISIBLE_DEVICES:-all} building > ---> Running in c86a35b08002 building > Removing intermediate container c86a35b08002 building > ---> dab67a85647e building > Step 12/12 : ENV NVIDIA_DRIVER_CAPABILITIES ${NVIDIA_DRIVER_CAPABILITIES:+$NVIDIA_DRIVER_CAPABILITIES,}graphics building > ---> Running in 324e7fe3176b building > Removing intermediate container 324e7fe3176b building > ---> 7375ab5ddb0c building > Successfully built 7375ab5ddb0c Executing command: docker run -it --rm --gpus all -e DISPLAY -e TERM -e QT_X11_NO_MITSHM=1 -e XAUTHORITY=/tmp/.docker.xauth -v /tmp/.docker.xauth:/tmp/.docker.xauth -v /tmp/.X11-unix:/tmp/.X11-unix -v /etc/localtime:/etc/localtime:ro 7375ab5ddb0c bash /usr/bin/docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=10.0, please update your driver to a newer version, or use an earlier cuda container\\\\n\\\"\"": unknown. ```

Env

$ apt-cache policy python3-rocker 
python3-rocker:
  Installed: 0.2.3-100
  Candidate: 0.2.3-100

$ nvcc --version

Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

$ docker inspect registry.gitlab.com/ppp/product/foo/baa:brancheee | grep -i nvidia
                "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
                "NVIDIA_VISIBLE_DEVICES=all",
                "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
                "NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411",
                "com.nvidia.cudnn.version": "7.6.5.32",
                "maintainer": "NVIDIA CORPORATION <cudatools@nvidia.com>"
                "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
                "NVIDIA_VISIBLE_DEVICES=all",
                "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
                "NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411",
                "com.nvidia.cudnn.version": "7.6.5.32",
                "maintainer": "NVIDIA CORPORATION <cudatools@nvidia.com>"

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

$ apt-cache policy nvidia-docker2
nvidia-docker2:
  Installed: 2.5.0-1
  Candidate: 2.5.0-1
$ apt list --installed nvidia*
Listing... Done
nvidia-384-dev/focal-updates,focal-security,now 390.141-0ubuntu0.20.04.1 amd64 [installed]
nvidia-compute-utils-390/focal-updates,focal-security,now 390.141-0ubuntu0.20.04.1 amd64 [installed,automatic]
nvidia-container-runtime/bionic,now 3.4.2-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,now 1.4.2-1 amd64 [installed,automatic]
nvidia-dkms-390/focal-updates,focal-security,now 390.141-0ubuntu0.20.04.1 amd64 [installed,automatic]
nvidia-docker2/bionic,now 2.5.0-1 all [installed]
nvidia-driver-390/focal-updates,focal-security,now 390.141-0ubuntu0.20.04.1 amd64 [installed,automatic]
nvidia-kernel-common-390/focal-updates,focal-security,now 390.141-0ubuntu0.20.04.1 amd64 [installed,automatic]
nvidia-kernel-source-390/focal-updates,focal-security,now 390.141-0ubuntu0.20.04.1 amd64 [installed,automatic]
nvidia-prime/focal-updates,focal-updates,now 0.8.16~0.20.04.1 all [installed,automatic]
nvidia-settings/focal-updates,now 460.39-0ubuntu0.20.04.1 amd64 [installed,automatic]
nvidia-utils-390/focal-updates,focal-security,now 390.141-0ubuntu0.20.04.1 amd64 [installed,automatic]

$ uname -a
Linux 130s-p50 5.4.0-70-generic #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
tfoote commented 3 years ago

I don't know that we have any good solution to mapping through to an older nvidia driver. Part of how docker works is to map things through from the internal driver to the external driver and if they're too different I don't know of ways to make them work well together unless they match. If someone can find a way to make it work that would be great. But I don't plan to try to solve this.

130s commented 3 years ago

Fair enough. Suggestion I had was to at least print a message friendly enough for non-container expert. But now that I posted enough info on this ticket, hopefully those users can hit this and notice what they can do, which I think is to try using the same driver version both on the host and on the container.

Closing as I see this labeled as wontfix.