Closed sandipt closed 3 weeks ago
When I checked it, it was not necessary because that was the default path, but I need to doublecheck. I was trying on SLE15sp6 but I am getting a dependency missing when installing the rpms from https://download.nvidia.com/suse/sle15sp6
. I'll check next week with SLE15sp5 and see if I can reproduce your issue
grep nvidia /etc/containerd/config.toml
That does seem like the wrong path, unless you're using the system containerd for some reason. Our managed containerd config is at /var/lib/rancher/rke2/agent/etc/containerd/config.toml
, that's where you'd want to check to confirm that the nvidia runtimes are being picked up by RKE2.
I have just checked and the docs are ok, I can see the nvidia operator working
Closing with https://github.com/rancher/rke2-docs/pull/264.
I have added path /var/lib/rancher/rke2/agent/etc/containerd/config.toml in gpu-operator ClusterPolicy but still NVIDIA driver libs/bins not mounting inside gpu pod.
sles-rke2-node:~ # kubectl get crd | grep -i nvidia
clusterpolicies.nvidia.com 2024-09-30T19:17:46Z
nvidiadrivers.nvidia.com 2024-09-30T19:17:46Z
sles-rke2-node:~ #
sles-rke2-node:~ # kubectl edit clusterpolicies.nvidia.com
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
As soon as I replaced file path /var/lib/rancher/rke2/agent/etc/containerd/config.toml with /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl in gpu operator ClusterPolicy
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
all NVIDIA driver libraries and binaries mounting inside gpu pod. See the output of command mount | grep -i nvidia
inside gpu pod container mentioned in first comment.
@manuelbuil anything referring to /etc/containerd/config.toml
would be incorrect, as that is not the correct path for the RKE2-managed containerd config.
In addition to that, the correct path needs to be added to the env var values in the HelmChart here, as mentioned above:
I'm now confused for different reasons. Let's take one at a time. First one: either our testing is not complete or there is something I can't replicate in my env. Let me explain you what I do to test this: I follow the instructions in our link: https://docs.rke2.io/advanced#deploy-nvidia-operator. And I can see the nvidia operator running:
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-8lhfc 1/1 Running 0 133m
gpu-operator-6855b5d55d-qtphk 1/1 Running 0 133m
gpu-operator-node-feature-discovery-gc-6dd8c5b44f-96gcn 1/1 Running 0 133m
gpu-operator-node-feature-discovery-master-f7b95b446-mbwld 1/1 Running 0 133m
gpu-operator-node-feature-discovery-worker-8wtrc 1/1 Running 0 133m
nvidia-container-toolkit-daemonset-bvnxb 1/1 Running 0 133m
nvidia-cuda-validator-g66q9 0/1 Completed 0 132m
nvidia-dcgm-exporter-kpx65 1/1 Running 0 133m
nvidia-device-plugin-daemonset-gqrt4 1/1 Running 0 133m
nvidia-operator-validator-5btmq 1/1 Running 0 133m
Everything seems correct. Then I deploy the testing pod which uses the gpu:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
args: ["nbody", "-gpu", "-benchmark"]
resources:
limits:
nvidia.com/gpu: 1
And the logs seem correct because they detect the GPU and even do a small benchmark:
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Maxwell" with compute capability 5.2
> Compute 5.2 CUDA device: [Tesla M60]
16384 bodies, total time for 10 iterations: 37.037 ms
= 72.477 billion interactions per second
= 1449.546 single-precision GFLOP/s at 20 flops per interaction
Therefore, my guess is the NVIDIA driver is correctly exposed to that pod, otherwise the logs would be different, or?
@sandipt Is the test I am describing not good enough? Can you help us understand what would be a good test to verify things are correct?
I think the problem is not that it doesn't work, its that the validation steps show some incorrect paths.
Go inside your gpu pod using kubectl exec --stdin --tty <pod-name> -- /bin/bash
and run mount | grep -i nvidia
command . It should show below nvidia binaries and libraries mounted .
sles-rke2-node:~ # kubectl exec --stdin --tty pytorch-test -- /bin/bash
root@pytorch-test:/workspace# mount | grep -i nvidia
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
/dev/sda3 on /usr/bin/nvidia-smi type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/bin/nvidia-debugdump type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/bin/nvidia-persistenced type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/bin/nvidia-cuda-mps-control type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/bin/nvidia-cuda-mps-server type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
devtmpfs on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64)
devtmpfs on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64)
devtmpfs on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64)
devtmpfs on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64)
proc on /proc/driver/nvidia/gpus/0000:13:00.0 type proc (ro,nosuid,nodev,noexec,relatime)
root@pytorch-test:/workspace# mount | grep -i cuda
/dev/sda3 on /usr/bin/nvidia-cuda-mps-control type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/bin/nvidia-cuda-mps-server type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libcuda.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
/dev/sda3 on /usr/lib/x86_64-linux-gnu/libcudadebugger.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
@sandipt There is a bug in our code and we need to add the nvidia cdi runtime to k3s and rke2. I'll update the docs once the code is merged and ping you so that you can quickly test it if you have time. Thanks for helping us!
Go inside your gpu pod using
kubectl exec --stdin --tty <pod-name> -- /bin/bash
and runmount | grep -i nvidia
command . It should show below nvidia binaries and libraries mounted .sles-rke2-node:~ # kubectl exec --stdin --tty pytorch-test -- /bin/bash root@pytorch-test:/workspace# mount | grep -i nvidia tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64) /dev/sda3 on /usr/bin/nvidia-smi type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/bin/nvidia-debugdump type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/bin/nvidia-persistenced type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/bin/nvidia-cuda-mps-control type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/bin/nvidia-cuda-mps-server type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) devtmpfs on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64) devtmpfs on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64) devtmpfs on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64) devtmpfs on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,size=4096k,nr_inodes=1048576,mode=755,inode64) proc on /proc/driver/nvidia/gpus/0000:13:00.0 type proc (ro,nosuid,nodev,noexec,relatime) root@pytorch-test:/workspace# mount | grep -i cuda /dev/sda3 on /usr/bin/nvidia-cuda-mps-control type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/bin/nvidia-cuda-mps-server type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libcuda.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot) /dev/sda3 on /usr/lib/x86_64-linux-gnu/libcudadebugger.so.560.35.03 type btrfs (ro,nosuid,nodev,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots/1/snapshot)
Hey @sandipt, do you use the following envs in your pod:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
and pass runtimeClassName: nvidia
?
I have not set them myself . I think it's by default. I see below envs set on my pod :
NVIDIA_VISIBLE_DEVICES=GPU-860e7217-d99b-3462-35a4-14da5d0ecfd9
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
In gpu operator crd clusterpolicies.nvidia.com
I have below settings for container toolkit:
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
Did we fix this issue/doc?
Did we fix this issue/doc?
Waiting on the rke2 release to update the docs
@sandipt There is a bug in our code and we need to add the nvidia cdi runtime to k3s and rke2. I'll update the docs once the code is merged and ping you so that you can quickly test it if you have time. Thanks for helping us!
Is this code public? If yes, can you share fix and change link ?
If you use the latest rke2 version, you should be able to see the mounts you were missing before. When executing grep nvidia /var/lib/rancher/rke2/agent/etc/containerd/config.toml
, you should see the nvidia.cdi runtime which is the one providing that
Hi kre2 team, I see NVIDIA libraries and binaries are not mounting inside gpu pod using gpu operator installation method at https://docs.rke2.io/advanced#deploy-nvidia-operator .
In doc https://docs.rke2.io/advanced#deploy-nvidia-operator and doc https://documentation.suse.com/suse-ai/1.0/html/NVIDIA-Operator-installation/index.html Only difference I see is file name
/var/lib/rancher/rke2/agent/etc/containerd/**config.toml.tmpl**
in envCONTAINERD_CONFIG
I quick tested on one of my SLES 15 SP5 node. I replacedconfig.toml
withconfig.toml.tmpl
inCONTAINERD_CONFIG
and all NVIDIA libs/bins mounted inside containers.I think you need to test and update doc https://docs.rke2.io/advanced#deploy-nvidia-operator for SLES OS.