Closed paulfantom closed 3 years ago
The docs have been updated to include coverage of the runtime auto-detection behavior and a suggestion to install the other nvidia tools: https://docs.k3s.io/advanced#nvidia-container-runtime-support
I'm going to lock this issue for now; if anyone has suggestions for docs improvements please open a PR!
Environmental Info: K3s Version:
1.22.3-rc4+k3s1
Node(s) CPU architecture, OS, and Version:
Linux metal01 5.4.0-89-generic #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
1 control-plane node, 3 agents
Describe the bug:
nvidia-device-plugin is crashlooping with the following errors:
Steps To Reproduce:
Expected behavior:
Nvidia device plugin is not crashlooping
Actual behavior:
Nvidia plugin is crashlooping and GPU is not usable.
Additional context / logs:
I upgraded cluster from 1.21 where GPU was using runc v1 and everything worked fine with custom containerd config. After upgrade and wiping out whole node I was presented with issues regarding NVML initialization. After following what was described in https://github.com/k3s-io/k3s/issues/4070 I got to a state where container cannot be started due to log message mentioned earlier. Other pods on that node using default runtimeClass are working just fine.
At current state I am not sure if this is some issue on my side, nvidia-plugin side, or k3s so any help would be apreciated.
My deployment manifests are available at https://github.com/thaum-xyz/ankhmorpork/tree/master/base/kube-system/device-plugins
Backporting