pop-os / beta

Pop!_OS Beta
359 stars 19 forks source link

nvidia-container-toolkit broken and cgroups v2 issues #289

Open RafalSkolasinski opened 2 years ago

RafalSkolasinski commented 2 years ago

How did you upgrade to 21.10? (Fresh install / Upgrade)

Upgrade from 21.04 (actually it was quite accidental in sense I was not aware it was still beta :))

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

nvidia-container-toolkit:
  Installed: 1.5.1-1
  Candidate: 1.5.1-1
  Version table:
 *** 1.5.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
        100 /var/lib/dpkg/status
     1.5.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.3.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.5-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.4-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.3-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages

Issue/Bug Description:

Package nividia-container-toolkit was missing. Previously it was provided from

nvidia-container-toolkit:
  Installed: 1.5.1-1pop1~1627998766~21.04~9847cf2
  Candidate: 1.5.1-1pop1~1627998766~21.04~9847cf2
  Version table:
 *** 1.5.1-1pop1~1627998766~21.04~9847cf2 1001
       1001 http://ppa.launchpad.net/system76/pop/ubuntu hirsute/main amd64 Packages
        100 /var/lib/dpkg/status

I did have to try to get it from older releases with

distribution=ubuntu20.04
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

but then I was getting

$ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled 

Seems that this is issue with cgroups v2 (googling for error leads to quite a few issues out there already reported - I will try to compile list later) and the workaround (not a solution) seemed to be

sudo kernelstub -a "systemd.unified_cgroup_hierarchy=0"
sudo update-initramfs -c -k all
sudo reboot

Steps to reproduce (if you know):

  1. Get 21.10 PopOS
  2. Install nvidia-container-toolkit (and other nvidia stuff)
  3. Try to use docker run --gpus all ... command

Expected behavior:

it works fine with output along

docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Thu Nov 11 10:21:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   39C    P8     7W / 185W |   1486MiB /  7979MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Other Notes:

Happy to provide additional information. I planned to reinstall my machine back to 21.04 but decided to postpone by a day or two in case you'd like to get some more information about the problem or have some advice.

RafalSkolasinski commented 2 years ago

May be related: https://github.com/NVIDIA/nvidia-docker/issues/1447

elezar commented 1 year ago

Note that only versions after v1.8.0 of the NVIDIA Container Toolkit (including libnvidia-container1) support cgroupv2. Please install a more recent version and see if this addresses your issue.