Newer runc versions break support for NVIDIA GPUs

klueska commented 1 year ago

The following logic originally introduced in libcontainer/cgroups/systemd/common.go and then later moved / extended to the following in libcontainer/cgroups/devices/systemd.go will never match against NVIDIA GPU devices.

                } else {
                        // "_ n:m _" rules are just a path in /dev/{block,char}/.
                        switch rule.Type {
                        case devices.BlockDevice:
                                entry.Path = fmt.Sprintf("/dev/block/%d:%d", rule.Major, rule.Minor)
                        case devices.CharDevice:
                                entry.Path = fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor)
                        }
                        // systemd will issue a warning if the path we give here doesn't exist.
                        // Since all of this logic is best-effort anyway (we manually set these
                        // rules separately to systemd) we can safely skip entries that don't
                        // have a corresponding path.
                        if _, err := os.Stat(entry.Path); err != nil {
                                // Also check /sys/dev so that we don't depend on /dev/{block,char}
                                // being populated. (/dev/{block,char} is populated by udev, which
                                // isn't strictly required for systemd). Ironically, this happens most
                                // easily when starting containerd within a runc created container
                                // itself.

                                // We don't bother with securejoin here because we create entry.Path
                                // right above here, so we know it's safe.
                                if _, err := os.Stat("/sys" + entry.Path); err != nil {
                                        logrus.Warnf("skipping device %s for systemd: %s", entry.Path, err)
                                        continue
                                }
                        }
                }
                deviceAllowList = append(deviceAllowList, entry)

The end result is that containers have access to GPUs when they first come online, but then lose access to GPUs as soon as systemd is triggered to run some reevaluation of the cgroups it manages (e.g. with something as simple as a systemctl daemon-reload).

The reason being that the NVIDIA driver does not register any of its devices with /sys/dev/char. The NVIDIA driver is closed source / non-GPL, and thus is unable to call the standard linux helpers to register the devices it manages here. In fact, the device nodes that ultimately get created under /dev for all NVIDIA devices are all triggered from user-space code.

I have filed an issue internal to NVIDIA to investigate what can be done on the driver front here, but until something gets put in place, GPU support on newer runcs (with systemd cgroup integration) remains to be broken.

We are also investigating workarounds to manually create the necessary symlinks under /dev/char, but so far nothing we have come up with is "fool-proof".

Would it be possible to use the device node under /dev for the DeviceAllowList here rather than relying on the existence of a node under /dev/char or /sys/dev/char?

Here is a simple reproducer using docker on ubuntu22.04 (with a single K80 GPU on it):

$ docker run -d --rm --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

ubuntu@ip-172-31-57-160:~$ sudo systemctl daemon-reload

ubuntu@ip-172-31-57-160:~$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
Failed to initialize NVML: Unknown Error
Failed to initialize NVML: Unknown Error

Which is then fixed if I (manually) do the following:

$ sudo ln -s ../nvidia0 /dev/char/195:0
$ sudo ln -s ../nvidiactl /dev/char/195:255
$ sudo ln -s ../nvidia-modeset /dev/char/195:254
$ sudo ln -s ../nvidia-uvm /dev/char/510:0
$ sudo ln -s ../nvidia-uvm-tools /dev/char/510:1
$ sudo ln -s ../nvidia-caps/nvidia-cap1 /dev/char/235:1
$ sudo ln -s ../nvidia-caps/nvidia-cap2 /dev/char/235:2

Rerunning the above -- the issue is now resolved:

$ docker run -d --rm --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
cb1a10d2467b4974006bd8cec66c7528cbf244c4b724b2325390bd0d9ae4e605

$ docker logs cb1a10d2467b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

$ sudo systemctl daemon-reload

$ docker logs cb1a10d2467b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

klueska commented 1 year ago

/cc @kolyshkin

kolyshkin commented 1 year ago

Well, all I can say that this is complicated; I was looking into that while reviewing PR #3568 (see its comments).

The problem is, runtime spec (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md to be precise) defines a host device by its major:minor pair (path is only used for the in-container device node). Similarly, both cgroup v1 and cgroup v2 device controllesr use major:minor pair.

OTOH systemd uses path to refer to a device. Since we don't have a way to specify a path (in runtime-spec, that is), we need a way to convert from a major:minor pair to a device name (to supply that to systemd). Fortunately, we have /dev/{block,char}/MM:mm nodes so we can translate easily.

Unfortunately, this is not working for NVidia devices. A workaround is proposed in https://github.com/opencontainers/runc/pull/3568 but in essence this is a hardcoded kludge, and unfortunately I don't see a clean way to handle this (other than introducing host path to runtime-spec, that is).

The non-hardcoded kludge would be to read and parse the whole /dev, trying to find the node with the needed major:minor. This is time consuming (we need to stat(2) every file) and may have some security implications.

Now, all this is happening because Nvidia doesn't follow the standard practice of device registration. So, for me, the best solution seems to be doing those /dev/char symlinks you demonstrated above in your workaround in some post module loading scripts.

If you have any ideas @klueska please let us know.

klueska commented 1 year ago

Yeah, the hardcoded kludge in https://github.com/opencontainers/runc/pull/3568 is definitely not the answer. It's not even guaranteed to work as the device majors (for some of the devices) can change, and the nvidia-caps device major is not even considered.

Two options I see that could help are:

Re-evaluate the set of allowed-devices after the prestart/createRuntime hooks have run
Introduce a new hook before the initial cgroup setup happens

This would allow us to insert a hook that creates the necessary /dev/char files "just-in-time" before the deviceAllowList is populated.

Unfortunately, this won't work for the rootless case, but it would at least provide a well-defined "out" for most of our users.

Regarding:

Now, all this is happening because Nvidia doesn't follow the standard practice of device registration.

I agree the best solution is for NVIDIA to start registering its devices under /sys/dev. This problem would simply go away if this was done properly. Unfortunately, there is internal resistance to do so because of GPL compatibility issues. It could be done in the new open source version of the NVIDIA kernel module (https://github.com/NVIDIA/open-gpu-kernel-modules), but that is not production ready on all systems yet, and will likely take years for it to become so.

Regarding:

So, for me, the best solution seems to be doing those /dev/char symlinks you demonstrated above in your workaround in some post module loading scripts.

Unfortunately simply running a post-module-loading script is not sufficient for two reasons:

Some device nodes only get created long after the module has loaded (in support of something called MIG, which allows you to dynamically divide a GPU into a set of smaller mini-gpus).
We often load the NVIDIA kernel module from within a container, meaning that this script would end up creating symlinks under /dev/char relative to the container's root, not the host root. All of the NVIDIA device nodes that we create in this scenario are also relative to the container's /dev, so the script would somehow need to know the actual host path to these devices to create the symlinks in /dev/char anyway.

Workarounds for both (1) and (2) are feasible (i.e. start a long-running script (on the host) that uses inotifywait to watch for the creation of nvidia device nodes at either /dev or a well-defined location where our "driver container" has mapped its /dev directory back onto the host).

However, we'd much prefer a "just-in-time" solution (if possible) until the NVIDIA driver team is able to come up with a proper solution at the driver level.

klueska commented 1 year ago

Thinking about it a bit more....

Is there any particular reason you can't just use the absolute path to the device node inside the container? From what I understood in your response, device.Path contains this information (modulo the container root, which I assume shouldn't be too hard to make available here). Presumably you create this device node inside the container with the correct major / minor number, so why not just point systemd at this instead of its "original" reference on the host at /dev/char?

kolyshkin commented 1 year ago

I might finally have a fix for that: https://github.com/opencontainers/runc/pull/3842

kolyshkin commented 1 year ago

@klueska can you please check if the patch from #3842 / #3845 fixes this issue?

klueska commented 1 year ago

@elezar can you take a look while I am away.

elezar commented 1 year ago

Sure. Will have a look tomorrow.

kolyshkin commented 1 year ago

With the fix I was able to give access to a device which does not have an appropriate /dev/{char,block}/MM:mm entry, using a systemd cgroup manager, so I guess this is now fixed, but just want to confirm.

elezar commented 1 year ago

I can confirm that the failure mode is no longer triggered with the fixes in place.

I first started to reproduce the behaviour that @klueska described using:

$ runc --version
runc version 1.1.5
commit: v1.1.5-0-gf19387a
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.3

that was installed on my test system.

I then cloned the runc repo and set up runc-dev as a runtime in my docker daemon config:

{
    "runtimes": {
        "runc-dev": {
            "args": [],
            "path": "/home/ubuntu/src/runc/runc"
        }
    }
}

For the tips of main and the release-1.1 branch, I then built runc from source and started the test using the runc-dev runtime for the following versions:

$ ./runc --version
runc version 1.1.0+dev
commit: v1.1.0-515-g30f9f808
spec: 1.1.0-rc.1
go: go1.20.3
libseccomp: 2.5.3

$ ./runc --version
runc version 1.1.6+dev
commit: v1.1.6-4-gf72cd0a6
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.3

This was from a clean boot and no nvidia-specific symlinks are present in /dev/char:

$ ls -al /dev/char | grep nvidia | wc -l
0

cyphar commented 1 year ago

Can we close this now that #3842 / #3845 have been merged (and 1.1.7 is out)? The fix is only going to work with systemd 240+ but there's not much we can do about that.

elezar commented 1 year ago

I will leave it to @klueska to make that call. I do believe he's out for a couple of weeks, so it won't be immediately.

kolyshkin commented 1 year ago

No response from @klueska, so I consider this fixed. A comment can still be added to a closed bug, so feel free to share your findings.

vsxen commented 3 months ago

runc 1.12 ,Redhat9 systemd 252 still has issues.

kolyshkin commented 3 months ago

runc 1.12 ,Redhat9 systemd 252 still has issues.

Please file a separate bug with a detailed reproducer.

vsxen commented 3 months ago

# install nvidia driver & nvidia-container-toolkit

nvidia-smi
Tue Jul  9 16:30:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:2F:00.0 Off |                    0 |
| N/A   50C    P8             16W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:86:00.0 Off |                    0 |
| N/A   47C    P8             16W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

wget https://github.com/containerd/containerd/releases/download/v1.7.16/cri-containerd-cni-1.7.16-linux-amd64.tar.gz

#  run a container

ctr -n k8s.io run --net-host --cgroup "k8s.slice:demo:nvidia" --runc-systemd-cgroup --rm -t --env NVIDIA_VISIBLE_DEVICES=all --runc-binary /usr/bin/nvidia-container-runtime nvidia/k8s-device-plugin  nvidia-smi bash

# daemon  reload
systemctl daemon-reload

nvidia-smi
Failed to initialize NVML: Unknown Error

nvidia-container-runtime -v
runc version 1.1.12
commit: v1.1.12-0-g51d5e946
spec: 1.0.2-dev
go: go1.20.13
libseccomp: 2.5.2

#systemctl --version
systemd 252 (252-32.el9_4)

Another thing I am curious about is that DeviceAllow has never had Nvidia configuration

# /run/systemd/transient/demo-nvidia.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:1 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

kolyshkin commented 3 months ago

@vsxen I don't see you explicitly granting access to any nvidia devices.

Also, as I said earlier, please file a separate bug report if you feel something is broken.

opencontainers / runc

Newer runc versions break support for NVIDIA GPUs #3708