k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.08k stars 2.35k forks source link

Issues with nvidia device plugin #4391

Closed paulfantom closed 3 years ago

paulfantom commented 3 years ago

Environmental Info: K3s Version:

1.22.3-rc4+k3s1

Node(s) CPU architecture, OS, and Version:

Linux metal01 5.4.0-89-generic #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 control-plane node, 3 agents

Describe the bug:

nvidia-device-plugin is crashlooping with the following errors:

failed to try resolving symlinks in path "/var/log/pods/kube-system_nvidia-device-plugin-daemonset-twrj5_07b07c46-45aa-4b4d-b30d-06054a939784/nvidia-device-plugin-ctr/1.log": lstat /var/log/pods/kube-system_nvidia-device-plugin-daemonset-twrj5_07b07c46-45aa-4b4d-b30d-06054a939784/nvidia-device-plugin-ctr/1.log: no such file or directory%

Steps To Reproduce:

Expected behavior:

Nvidia device plugin is not crashlooping

Actual behavior:

Nvidia plugin is crashlooping and GPU is not usable.

Additional context / logs:

I upgraded cluster from 1.21 where GPU was using runc v1 and everything worked fine with custom containerd config. After upgrade and wiping out whole node I was presented with issues regarding NVML initialization. After following what was described in https://github.com/k3s-io/k3s/issues/4070 I got to a state where container cannot be started due to log message mentioned earlier. Other pods on that node using default runtimeClass are working just fine.

At current state I am not sure if this is some issue on my side, nvidia-plugin side, or k3s so any help would be apreciated.

My deployment manifests are available at https://github.com/thaum-xyz/ankhmorpork/tree/master/base/kube-system/device-plugins

Backporting

brandond commented 1 year ago

The docs have been updated to include coverage of the runtime auto-detection behavior and a suggestion to install the other nvidia tools: https://docs.k3s.io/advanced#nvidia-container-runtime-support

I'm going to lock this issue for now; if anyone has suggestions for docs improvements please open a PR!