Running runsc with containerd and `--nvproxy=true` removes NVIDIA drivers from container in Kubernetes

Description

Hello. I'm trying to get gVisor to work with NVIDIA drivers in Kubernetes, using the regular AWS EKS Amazon Linux 2 AMI (not the GPU one). I can confirm that both work separately; however, I'm having a lot of troubles getting gVisor to work with the NVIDIA drivers. When I try to run the nvidia:cuda image using the gVisor runtime class, I can see that the environment variables are correctly set, but the nvidia-smi binary is gone. These are all the files I'm using:

config.toml

root = "/var/lib/containerd"
state = "/run/containerd"
version = 2

[grpc]
  address = "/run/containerd/containerd.sock"

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      discard_unpacked_layers = true

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
          runtime_type = "io.containerd.runsc.v1"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc.options]
          TypeUrl = "io.containerd.runsc.v1.options"
          ConfigPath = "/etc/containerd/runsc.toml"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

runsc.toml

log_path = "/var/log/runsc/%ID%/shim.log"
log_level = "debug"
[runsc_config]
  nvproxy = "true"
  debug = "true"
  debug-log = "/var/log/runsc/%ID%/gvisor.%COMMAND%.log"

Test pod:

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-version-check
spec:
  runtimeClassName: gvisor
  affinity:               
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:                      
        - matchExpressions:
          - key: node-role.test.io/gpu
            operator: Exists
  tolerations:              
    - key: node-role.test.io/gpu
      operator: Exists
      effect: NoSchedule
  restartPolicy: OnFailure
  containers:
  - name: nvidia-version-check
    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
    command: ["tail", "-f", "/dev/null"]
    resources:
      limits:
         nvidia.com/gpu: "1"
EOF

Execing in to the pod:

❯ k exec -it nvidia-version-check -- bash
root@nvidia-version-check:/# env | grep NVIDIA
NVIDIA_VISIBLE_DEVICES=GPU-873dadb3-e07f-436d-abc6-4bcea3b3a9e2
NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419
NVIDIA_DRIVER_CAPABILITIES=compute,utility
root@nvidia-version-check:/# dmesg
[    0.000000] Starting gVisor...
[    0.436519] Generating random numbers by fair dice roll...
[    0.723402] Checking naughty and nice process list...
[    0.822312] Creating cloned children...
[    0.918708] Committing treasure map to memory...
[    1.208497] Daemonizing children...
[    1.504345] Mounting deweydecimalfs...
[    1.944738] Creating bureaucratic processes...
[    1.948269] Constructing home...
[    2.122704] Synthesizing system calls...
[    2.155253] Searching for needles in stacks...
[    2.579675] Setting up VFS...
[    2.866641] Setting up FUSE...
[    3.103859] Ready!
root@nvidia-version-check:/# ls /usr/local/cuda
compat  lib64  targets
root@nvidia-version-check:/# which nvidia-smi

I have the NVIDIA Plugin DaemonSet running using the nvidia runtime class.

Steps to reproduce

# Install NVIDIA drivers and container toolkit
sudo yum install -y gcc kernel-devel-$(uname -r)
DRIVER_VERSION=525.60.13
curl -fSsl -O "https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run"
chmod +x NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
sudo CC=/usr/bin/gcc10-cc ./NVIDIA-Linux-x86_64-$DRIVER_VERSION.run --silent
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit

# Install runsc and containerd-shim-runsc-v1
ARCH=$(uname -m)
URL=https://storage.googleapis.com/gvisor/releases/release/latest/$ARCH
wget $URL/runsc $URL/containerd-shim-runsc-v1
chmod a+rx runsc containerd-shim-runsc-v1
sudo mv runsc containerd-shim-runsc-v1 /usr/bin

# Update `/etc/containerd/config.toml` to match the one above

# Update `/etc/containerd/runsc.toml`  to match the one above`

# Restart containerd
sudo systemctl restart containerd

# Deploy the NVIDIA Plugin daemonset 
# (update the affinity to only be scheduled to nodes with GPUs)
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

# Create the runtime classes
cat <<EOF | kubectl apply -f -  
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: nvidia              
handler: nvidia
EOF         

cat <<EOF | kubectl apply -f -  
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: gvisor              
handler: runsc 
EOF    

# Create the above pod and exec to it

runsc version

runsc version release-20230904.0
spec: 1.1.0-rc.1


### docker version (if using docker)

```shell
not using docker

uname

Linux ip-10-253-32-249.ec2.internal 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.16-eks-2d98532", GitCommit:"af930c12e26ef9d1e8fac7e3532ff4bcc1b2b509", GitTreeState:"clean", BuildDate:"2023-07-28T16:52:47Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}



### repo state (if built from source)

_No response_

### runsc debug logs (if available)

_No response_

Adding the logs for the container: logs.zip

Thanks for the very detailed report! Apologies for the delay. nvproxy is not supported with k8s-device-plugin yet, and we haven't investigated what needs to be done to add support. We would appreciate OSS contributions!

We are currently focused on establishing support in GKE. GKE uses a different GPU+container stack. It does not use k8s-device-plugin. It instead has its own device plugin: https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu. This configures the container in a different way. nvproxy in GKE is still experimental, but it works! Please let me know if you want to experiment on GKE, and we can provide more detailed instructions.

To summarize, nvproxy works in the following environments:

Docker: docker run --gpus= ... Needs --nvproxy-docker flag.
nvidia-container-runtime with legacy mode. Needs --nvproxy-docker flag.
GKE. Does not need --nvproxy-docker flag.

Thanks for the followup @ayushr2. In the meantime I've made some progress where by just using nvproxy, bootstrapping the host node with NVIDIA and then mounting the driver to the container using hostPath gets me to run nvidia-smi successfully. However, it seems it can't fully access the GPU:

==============NVSMI LOG==============

Timestamp                                 : Mon Oct 30 15:53:01 2023
Driver Version                            : 525.60.13
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : Tesla T4
    Product Brand                         : NVIDIA
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : GPU access blocked by the operating system
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : GPU access blocked by the operating system
    GPU UUID                              : GPU-3ec3e89a-b2ec-68d1-bb38-3becc2cf55cd
    Minor Number                          : 0
    VBIOS Version                         : Unknown Error
    MultiGPU Board                        : No
    Board ID                              : 0x1e
    Board Part Number                     : GPU access blocked by the operating system
    GPU Part Number                       : GPU access blocked by the operating system
    Module ID                             : GPU access blocked by the operating system
    Inforom Version
        Image Version                     : GPU access blocked by the operating system
        OEM Object                        : Unknown Error
        ECC Object                        : GPU access blocked by the operating system
        Power Management Object           : Unknown Error
    GPU Operation Mode
        Current                           : GPU access blocked by the operating system
        Pending                           : GPU access blocked by the operating system
    GSP Firmware Version                  : 525.60.13
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x1E
        Domain                            : 0x0000
        Device Id                         : 0x1EB810DE
        Bus Id                            : 00000000:00:1E.0
        Sub System Id                     : 0x12A210DE
        GPU Link Info
            PCIe Generation
                Max                       : Unknown Error
                Current                   : Unknown Error
                Device Current            : Unknown Error
                Device Max                : Unknown Error
                Host Max                  : Unknown Error
            Link Width
                Max                       : Unknown Error
                Current                   : Unknown Error
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : GPU access blocked by the operating system
        Replay Number Rollovers           : GPU access blocked by the operating system
        Tx Throughput                     : GPU access blocked by the operating system
        Rx Throughput                     : GPU access blocked by the operating system
        Atomic Caps Inbound               : GPU access blocked by the operating system
        Atomic Caps Outbound              : GPU access blocked by the operating system
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 15360 MiB
        Reserved                          : 399 MiB
        Used                              : 2 MiB
        Free                              : 14957 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : GPU access blocked by the operating system
        Average FPS                       : GPU access blocked by the operating system
        Average Latency                   : GPU access blocked by the operating system
    FBC Stats
        Active Sessions                   : GPU access blocked by the operating system
        Average FPS                       : GPU access blocked by the operating system
        Average Latency                   : GPU access blocked by the operating system
    Ecc Mode
        Current                           : GPU access blocked by the operating system
        Pending                           : GPU access blocked by the operating system
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : GPU access blocked by the operating system
        Double Bit ECC                    : GPU access blocked by the operating system
        Pending Page Blacklist            : GPU access blocked by the operating system
    Remapped Rows                         : GPU access blocked by the operating system
    Temperature
        GPU Current Temp                  : 22 C
        GPU Shutdown Temp                 : 96 C
        GPU Slowdown Temp                 : 93 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 13.16 W
        Power Limit                       : 70.00 W
        Default Power Limit               : 70.00 W
        Enforced Power Limit              : 70.00 W
        Min Power Limit                   : 60.00 W
        Max Power Limit                   : 70.00 W
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 1590 MHz
        Memory                            : 5001 MHz
    Default Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1590 MHz
        SM                                : 1590 MHz
        Memory                            : 5001 MHz
        Video                             : 1470 MHz
    Max Customer Boost Clocks
        Graphics                          : 1590 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None

I tried to also run this under runtimeClass: nvidia and this didn't happen, so it's definitely a gVisor issue now. Unfortunately for our use case GKE is not viable. I'll try with the options you described to see if I can get it working.

However, it seems it can't fully access the GPU

Yeah I don't think it will work just yet. In GKE, the container spec defines which GPUs to expose in spec.Linux.Devices. However, in the boot logs you attached above, I could not see any such devices defined. So gvisor will not expose any device.

My best guess is that k8s-device-plugin is creating bind mounts of /dev/nvidia* devices in the container's root filesystem and then expecting the container to be able to access that. That won't work with gVisor with any combination of our --nvproxy flags, because even though the devices exist on the host filesystem, they don't exist in our sentry's /dev filesystem (which is an in-memory filesystem).

In docker mode, the GPU devices are explicitly exposed like this. In GKE, the device files are automatically created here because spec.Linux.Devices defines it. So you could look into adding similar support for k8s-device-plugin environment.

Thanks for the detailed reply @ayushr2! Though I'm a bit out of my depth here, your guidance has been very helpful. I'm trying to better understand the differences for GKE; could you please point me to where the container spec/sandbox is defined? I'm not sure if it's possible to try to port that configuration over to Amazon Linux or if I should just try to add the feature directly to the gVisor code you pointed me to.

I've tried very naively to add the following snipped to runsc/boot/vfs.go:createDeviceFiles:

        mode := os.FileMode(int(0777))
        info.spec.Linux.Devices = append(info.spec.Linux.Devices, []specs.LinuxDevice{
            {
                Path:     "/dev/nvidia0",
                Type:     "c",
                Major:    195,
                Minor:    0,
                FileMode: &mode,
            },
            {
                Path:     "/dev/nvidia-modeset",
                Type:     "c",
                Major:    195,
                Minor:    254,
                FileMode: &mode,
            },
            {
                Path:     "/dev/nvidia-uvm",
                Type:     "c",
                Major:    245,
                Minor:    0,
                FileMode: &mode,
            },
            {
                Path:     "/dev/nvidia-uvm-tools",
                Type:     "c",
                Major:    245,
                Minor:    1,
                FileMode: &mode,
            },
        }...)

in order to try to mount the devices during runtime, but seems like even this isn't enough

You probably also want /dev/nvidiactl. You basically want to call this. Usually that is only called for --nvproxy-docker. JUST FOR TESTING try adding a new flag --nvproxy-k8s and change the condition on line 1221 to be if info.conf.NVProxyDocker || info.conf.NVProxyK8s { ...

Also note that the minor number of /dev/nvidia-uvm is different inside the sandbox. So just copying from host won't work.

Yeah, from reading the code and looking at the logs seems like gVisor automatically assigns a minor number to the device. Unfortunately your suggestion still didn't work. I'll leave the logs for the container in case you (or anyone that comes across this issue) want to use it for debugging (note that I had already added a nvproxy-automount-dev flag for the same purpose you suggested using nvproxy-k8s) runsc.tar.gz

Got it, thanks for working with me on this.

Just to set the expectations, adding support for k8s-device-plugin is currently not on our roadmap. We are focused on maturing GPU support in GKE first. OSS contributions for GPU support in additional environments is appreciated in the meantime!

No worries! In the meantime, we don't have a strict requirement for having NVIDIA working with gVisor so we can get around it. I'd love to help bringing in this feature but it would still need to get more familiarized with gVisor, but I'll help in any way I can!

A friendly reminder that this issue had no activity for 120 days.

@PedroRibeiro95 Have you done any new researches on this? I was doing some researches on this and it looks like it should be working with following configurations:

k8s-device-plugin does have a config called DEVICE_LIST_STRATEGY, which does allow device list to be returned back as CDI. Once kubelet receive allocate response from device plugin, then it should populate CDI spec file, and start containerd ( assuming we are just using containerd ). Then the containerd will parse CDI devices and convert device to oci spec file and pass the spec to runs or runsc. Then runsc should just create linux devices here as @ayushr2 just described. (I am assuming in this case nvidia-runtime is not needed since we don't need prestart hook?)

I never tested anything, everything mentioned above is just pure guess from me, but let me know if my reasoning makes sense or not.

Hey @sfc-gh-hyu, thanks for the detailed instructions. I haven't revisited this in the meantime as other priorities came up, but I will be testing it again very soon. I will try to follow what you suggested and I will report back with more details.

google / gvisor