Issues with nvidia device plugin

paulfantom commented 3 years ago

Environmental Info: K3s Version:

1.22.3-rc4+k3s1

Node(s) CPU architecture, OS, and Version:

Linux metal01 5.4.0-89-generic #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 control-plane node, 3 agents

Describe the bug:

nvidia-device-plugin is crashlooping with the following errors:

failed to try resolving symlinks in path "/var/log/pods/kube-system_nvidia-device-plugin-daemonset-twrj5_07b07c46-45aa-4b4d-b30d-06054a939784/nvidia-device-plugin-ctr/1.log": lstat /var/log/pods/kube-system_nvidia-device-plugin-daemonset-twrj5_07b07c46-45aa-4b4d-b30d-06054a939784/nvidia-device-plugin-ctr/1.log: no such file or directory%

Steps To Reproduce:

Installed K3s: via script

Expected behavior:

Nvidia device plugin is not crashlooping

Actual behavior:

Nvidia plugin is crashlooping and GPU is not usable.

Additional context / logs:

I upgraded cluster from 1.21 where GPU was using runc v1 and everything worked fine with custom containerd config. After upgrade and wiping out whole node I was presented with issues regarding NVML initialization. After following what was described in https://github.com/k3s-io/k3s/issues/4070 I got to a state where container cannot be started due to log message mentioned earlier. Other pods on that node using default runtimeClass are working just fine.

At current state I am not sure if this is some issue on my side, nvidia-plugin side, or k3s so any help would be apreciated.

My deployment manifests are available at https://github.com/thaum-xyz/ankhmorpork/tree/master/base/kube-system/device-plugins

Backporting

[ ] Needs backporting to older releases

brandond commented 3 years ago

It looks like some of the logs got cleaned up and it's confusing containerd. Do you have anything that might be trying to rotate the pod logs out from under containerd?

You might try running k3s-killall.sh and then rm -rf /var/log/pods, followed by starting K3s again. This will of course terminate all running pods but might fix whatever containerd is struggling with.

paulfantom commented 3 years ago

There is nothing I can think of that would rotate logs. This is a fresh ubuntu 20.04 instance only with k3s and nvidia drivers installed.

I did remove everything from /var/log/pods as wells as cleared whole /var/lib/rancher/k3s/agent and /var/lib/kubelet (actually done it few times in different order and steps). It did not help.

I also get other error messages from k3s which most likely relate to the issue:

Nov 04 11:21:08 metal01 k3s[930011]: E1104 11:21:08.364474  930011 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-device-plugin-ctr\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=nvidia-device-plugin-ctr pod=nvidia-device-plugin-daemonset-twrj5_kube-system(07b07c46-45aa-4b4d-b30d-06054a939784)\"" pod="kube-system/nvidia-device-plugin-daemonset-twrj5" podUID=07b07c46-45aa-4b4d-b30d-06054a939784
Nov 04 11:21:08 metal01 k3s[930011]: I1104 11:21:08.573763  930011 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201"
Nov 04 11:21:09 metal01 k3s[930011]: W1104 11:21:09.140176  930011 manager.go:1176] Failed to process watch event {EventType:0 Name:/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/ff954893be5edf196f2ccdddd950da868060bd8de2cf7aa839894d6964835b23 WatchSource:0}: task ff954893be5edf196f2ccdddd950da868060bd8de2cf7aa839894d6964835b23 not found: not found
Nov 04 11:21:09 metal01 k3s[930011]: W1104 11:21:09.140229  930011 watcher.go:95] Error while processing event ("/sys/fs/cgroup/devices/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201: no such file or directory
Nov 04 11:21:09 metal01 k3s[930011]: W1104 11:21:09.140293  930011 watcher.go:95] Error while processing event ("/sys/fs/cgroup/memory/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201: no such file or directory
Nov 04 11:21:09 metal01 k3s[930011]: W1104 11:21:09.140318  930011 watcher.go:95] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201: no such file or directory
Nov 04 11:21:09 metal01 k3s[930011]: W1104 11:21:09.140389  930011 watcher.go:95] Error while processing event ("/sys/fs/cgroup/pids/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201: no such file or directory
Nov 04 11:21:09 metal01 k3s[930011]: W1104 11:21:09.140460  930011 watcher.go:95] Error while processing event ("/sys/fs/cgroup/blkio/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/kubepods/pod07b07c46-45aa-4b4d-b30d-06054a939784/64d8ac27c5d1500c469dc62279c24d1b536382579973328e1fd3e96a64ed2201: no such file or directory

dweomer commented 3 years ago

Stopping k3s does not stop the pods that are running. This is by design. What those logs are telling you is that you have deleted locations out from under running (or at least, created) pods. As suggested at https://github.com/k3s-io/k3s/issues/4391#issuecomment-960513374, to stop everything that can be started by k3s you should run k3s-killall.sh. But since it appears that you have deleted locations that pods expect to exist you would likely be better off invoking k3s-uninstall.sh and starting over.

paulfantom commented 3 years ago

Stopping k3s does not stop the pods that are running.

I know :) Nodes were drained before stopping k3s.

But since it appears that you have deleted locations that pods expect to exist you would likely be better off invoking k3s-uninstall.sh and starting over.

That's what I did and that's why the instance is fresh as I reinstalled the whole node.

Just to be clear, this is not my first rodeo with kubernetes and I tried multiple options before writing this issue. The solution of "have you tried turning this off and on again" was the first I did :) The full chain of events on my side: 1) upgrade from 1.21 to 1.22 resulting in all pods failing to start due to missing containerd-shim binary. This is because I was using nvidia runtime as default with runc v1. 1) Few tests to figure out what is going on (deleting directories, restarting k3s, etc.). This is when I found https://github.com/k3s-io/k3s/issues/4070 1) Removing custom containerd config.toml.tmpl and using default configuration shipped with k3s 1) node drained, k3s restarted. All containers starting up apart from ones using nvidia runtime due to issue described here 1) Testing few different configurations of nvidia-device-plugin pod, but issue described here seems to be preventing pod from startup 1) Node teardown to discard issues related to stale confiigurations. Installing only k3s and only nvidia drivers on new node. 1) Issue still persists.

Right now my gut is telling me that this may be something in the nvidia runtime itself and integration of this runtime with k3s. I tried running the pod with default runtimeClassName and it works just fine (albeit without GPU access). However, setting runtimeClassName: nvidia and recreating pod leads to errors regarding log messages and cgroups.

What is surprising to me is the fact that in 1.21 everything works just fine and 1.22 breaks completely for workloads needing nvidia GPU.

brandond commented 3 years ago

Hmm, I can't see why it would be required, but we did drop the v1 runtime in 1.22 since it's been deprecated for a while: https://github.com/k3s-io/k3s/pull/3903

The Nvidia plugin should work fine without it unless for some reason you had configured the legacy runtime type in your custom containerd toml?

https://github.com/k3s-io/k3s/issues/3105#issuecomment-906672797

hlacikd commented 3 years ago

experiencing exactly same issue, in 1.21.6 everything works correctly, 1.22.3 adds automatic nvidia-container-runtime detection, but every deployment requesting runtimeclass nvidia is crash looping

 - gopro:deployment/gopro-vcr: container samba in error: &ContainerStateWaiting{Reason:CreateContainerError,Message:failed to get sandbox container task: no running task found: task d751f121e2ec5bee9b43b4c9698d43d31d7cb6cc68ccc59ae0c0b72200b16890 not found: not found,}
    - gopro:pod/gopro-vcr-6c4ccfb587-x98k6: container samba in error: &ContainerStateWaiting{Reason:CreateContainerError,Message:failed to get sandbox container task: no running task found: task d751f121e2ec5bee9b43b4c9698d43d31d7cb6cc68ccc59ae0c0b72200b16890 not found: not found,}
 - gopro:deployment/gopro-vcr: container samba is backing off waiting to restart
    - gopro:pod/gopro-vcr-6c4ccfb587-x98k6: container samba is backing off waiting to restart
      > [gopro-vcr-6c4ccfb587-x98k6 samba] failed to try resolving symlinks in path "/var/log/pods/gopro_gopro-vcr-6c4ccfb587-x98k6_a1325b13-67ab-401a-8919-2bb207641fc0/samba/1.log": lstat /var/log/pods/gopro_gopro-vcr-6c4ccfb587-x98k6_a1325b13-67ab-401a-8919-2bb207641fc0/samba/1.log: no such file or directory
 - gopro:deployment/gopro-vcr failed. Error: container samba is backing off waiting to restart.

brandond commented 3 years ago

@hlacik can you attach the logs from k3s starting up on your node (specifically the runtime detection bit), along with the containerd configuration toml that it is generating?

hlacikd commented 3 years ago

@brandond

jtsna   Ready    control-plane,master   5h39m   v1.22.3+k3s1
jtsnb   Ready    <none>                 5h38m   v1.22.3+k3s1

config.toml

root@jtsna-2111:/var/lib/rancher/k3s/agent/etc/containerd# cat config.toml 

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  sandbox_image = "rancher/mirrored-pause:3.1"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/86a8c46cd5fe617d1c1c90d80222fa4b7e04e7da9b3caace8af4daf90fc5a699/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

k3s.service log

-- Logs begin at Wed 2020-04-01 19:23:42 CEST, end at Thu 2021-11-11 22:29:56 CET. --
Nov 11 18:23:25 jtsna-2111 systemd[1]: Starting Lightweight Kubernetes...
Nov 11 18:23:25 jtsna-2111 sh[4246]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Nov 11 18:23:25 jtsna-2111 sh[4258]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Starting k3s v1.22.3+k3s1 (61a2aab2)"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Cluster bootstrap already complete"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Database tables and indexes are up to date"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Kine available at unix://kine.sock"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.carpc.local>
Nov 11 18:23:28 jtsna-2111 k3s[4297]: Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.588795    4297 server.go:581] external host was not specified, using 172.16.15.8
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Running kube-scheduler --authentication-kubeconfig=/var/lib/rancher/k3s/server/cred/scheduler.kubeconfig --authorization-kubeconfig=/var/lib/rancher/k>
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Waiting for API server to become available"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.640314    4297 server.go:175] Version: v1.22.3+k3s1
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Running kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --author>
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Running cloud-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/k3s/server/cred/cloud-controller.kubeconfig ->
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Node token is available at /var/lib/rancher/k3s/server/token"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="To join node to cluster: k3s agent -s https://172.16.15.8:6443 -t ${NODE_TOKEN}"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Wrote kubeconfig /etc/rancher/k3s/k3s.yaml"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="Run: k3s kubectl"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.681567    4297 shared_informer.go:240] Waiting for caches to sync for node_authorizer
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.818950    4297 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesB>
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.819022    4297 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimRe>
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.822841    4297 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesB>
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.822893    4297 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimRe>
Nov 11 18:23:28 jtsna-2111 k3s[4297]: W1111 18:23:28.897222    4297 genericapiserver.go:455] Skipping API apiextensions.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:28 jtsna-2111 k3s[4297]: I1111 18:23:28.899801    4297 instance.go:278] Using reconciler: lease
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="certificate CN=jtsna signed by CN=k3s-server-ca@1636645919: notBefore=2021-11-11 15:51:59 +0000 UTC notAfter=2022-11-11 17:23:28 +0000 UTC"
Nov 11 18:23:28 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:28+01:00" level=info msg="certificate CN=system:node:jtsna,O=system:nodes signed by CN=k3s-client-ca@1636645919: notBefore=2021-11-11 15:51:59 +0000 UTC notAfter=2022-11-11 17:>
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Module overlay was already loaded"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: I1111 18:23:29.044782    4297 rest.go:130] the default service ipfamily for this cluster is: IPv4
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Module br_netfilter was already loaded"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.089015    4297 sysinfo.go:203] Nodes topology is not available, providing CPU topology
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_max' to 131072"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Set sysctl 'net/ipv4/conf/all/forwarding' to 1"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Found nvidia container runtime at /usr/bin/nvidia-container-runtime"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root >
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.850324    4297 genericapiserver.go:455] Skipping API authentication.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.855986    4297 genericapiserver.go:455] Skipping API authorization.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.944179    4297 genericapiserver.go:455] Skipping API certificates.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.953629    4297 genericapiserver.go:455] Skipping API coordination.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.448543    4297 genericapiserver.go:455] Skipping API networking.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.460480    4297 genericapiserver.go:455] Skipping API node.k8s.io/v1alpha1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:29+01:00" level=info msg="Waiting for containerd startup: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /run/k3s/cont>
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.618490    4297 genericapiserver.go:455] Skipping API rbac.authorization.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.618534    4297 genericapiserver.go:455] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.624428    4297 genericapiserver.go:455] Skipping API scheduling.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.624468    4297 genericapiserver.go:455] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.640436    4297 genericapiserver.go:455] Skipping API storage.k8s.io/v1alpha1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.648652    4297 genericapiserver.go:455] Skipping API flowcontrol.apiserver.k8s.io/v1alpha1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.669939    4297 genericapiserver.go:455] Skipping API apps/v1beta2 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.670005    4297 genericapiserver.go:455] Skipping API apps/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.677042    4297 genericapiserver.go:455] Skipping API admissionregistration.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:29 jtsna-2111 k3s[4297]: I1111 18:23:29.690538    4297 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesB>
Nov 11 18:23:29 jtsna-2111 k3s[4297]: I1111 18:23:29.690589    4297 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimRe>
Nov 11 18:23:29 jtsna-2111 k3s[4297]: W1111 18:23:29.705227    4297 genericapiserver.go:455] Skipping API apiregistration.k8s.io/v1beta1 because it has no resources.
Nov 11 18:23:30 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:30+01:00" level=error msg="runtime core not ready"
Nov 11 18:23:30 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:30+01:00" level=info msg="Containerd is now running"
Nov 11 18:23:30 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:30+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Nov 11 18:23:30 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:30+01:00" level=info msg="Handling backend connection request [jtsna]"
Nov 11 18:23:30 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:30+01:00" level=info msg="Running kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --c>
Nov 11 18:23:30 jtsna-2111 k3s[4297]: Flag --cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
Nov 11 18:23:30 jtsna-2111 k3s[4297]: Flag --cni-bin-dir has been deprecated, will be removed along with dockershim.
Nov 11 18:23:30 jtsna-2111 k3s[4297]: Flag --cni-conf-dir has been deprecated, will be removed along with dockershim.
Nov 11 18:23:30 jtsna-2111 k3s[4297]: Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before bei>
Nov 11 18:23:30 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:30+01:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
Nov 11 18:23:30 jtsna-2111 k3s[4297]: I1111 18:23:30.692377    4297 server.go:436] "Kubelet version" kubeletVersion="v1.22.3+k3s1"
Nov 11 18:23:30 jtsna-2111 k3s[4297]: I1111 18:23:30.757884    4297 dynamic_cafile_content.go:155] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:35+01:00" level=error msg="runtime core not ready"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: time="2021-11-11T18:23:35+01:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: W1111 18:23:35.790140    4297 sysinfo.go:203] Nodes topology is not available, providing CPU topology
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.798286    4297 server.go:687] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.800477    4297 container_manager_linux.go:280] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.800713    4297 container_manager_linux.go:285] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: Container>
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.804549    4297 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.804611    4297 container_manager_linux.go:320] "Creating device plugin manager" devicePluginEnabled=true
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.805290    4297 state_mem.go:36] "Initialized new in-memory state store"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.807245    4297 kubelet.go:418] "Attempting to sync node with API server"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.807356    4297 kubelet.go:279] "Adding static pod path" path="/var/lib/rancher/k3s/agent/pod-manifests"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.808649    4297 kubelet.go:290] "Adding apiserver pod source"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.811299    4297 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.823719    4297 kuberuntime_manager.go:245] "Container runtime initialized" containerRuntime="containerd" version="v1.5.7-k3s2" apiVersion="v1alpha2"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.831055    4297 server.go:1213] "Started kubelet"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.833142    4297 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.835980    4297 server.go:409] "Adding debug handlers to kubelet server"
Nov 11 18:23:35 jtsna-2111 k3s[4297]: I1111 18:23:35.842247    4297 secure_serving.go:266] Serving securely on 127.0.0.1:6444

hlacikd commented 3 years ago

@brandond to me generated config.toml seem fine this is what i was using on 1.21.6 i was adding this manually via config.tml.tmpl

root@jtsnb-2111:~# cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  sandbox_image = "rancher/pause:3.1"

[plugins.cri.containerd]
  disable_snapshot_annotations = true
  snapshotter = "overlayfs"

[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/e265ce840ebe0eaaebfc0eba8cac0a94057c6bccadc5a194b2db1b07e65f63a0/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

# BEGIN nvidia-container-runtime
[plugins.cri.containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"

  [plugins.cri.containerd.runtimes.nvidia.options]
    BinaryName = "/usr/bin/nvidia-container-runtime"
# END nvidia-container-runtime

and it was working. Seems identical except "" which i think is no difference

also i want to note, that this is same OS (ubuntu 20.04 on arm64) , i have removed 1.21.5 via k3s-uninstall.sh , and installed fresh 1.22.3 , so i can confirm it has nothing to do with OS configuration/packages

this is runtimeclass with handler nvidia:

apiVersion: node.k8s.io/v1 # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
  name: nvidia # The name the RuntimeClass will be referenced by
  # RuntimeClass is a non-namespaced resource
handler: nvidia # The name of the corresponding CRI configuration

, which i am using in deployments when i want to use nvidia-container-runtime

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vcr
spec:
  selector:
    matchLabels:
      app: vcr
  template:
    metadata:
      labels:
        app: vcr
    spec:
      imagePullSecrets:
        - name: registry-pcr-docker
      terminationGracePeriodSeconds: 10

      runtimeClassName: nvidia
      containers:
      - name: vcr-0
        image: vcr
        args:
          - --rhost=dev-basler
          - --key=basler:0
          - --chunk_duration=60
        volumeMounts:
          - mountPath: /videos
            name: videos
          - mountPath: /tmp/argus_socket
            name: argus

      volumes:
        - name: videos
          persistentVolumeClaim:
            claimName: videos
        - name: argus
          hostPath:
            path: /tmp/argus_socket

paulfantom commented 3 years ago

Small log update, I am getting the following events on nvidia device plugin v0.10.0 pod start (using nvidia runtimeclass):

  Normal   Scheduled       95m                   default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-h4v5c to metal01
  Normal   Pulling         92m                   kubelet            Pulling image "nvidia/k8s-device-plugin:v0.10.0"
  Normal   Pulled          88m                   kubelet            Successfully pulled image "nvidia/k8s-device-plugin:v0.10.0" in 3m35.07358951s
  Warning  Failed          88m                   kubelet            Error: failed to get sandbox container task: no running task found: container not created: not found
  Warning  Failed          88m                   kubelet            Error: failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:364: creating new parent process caused: container_linux.go:2005: running lstat on namespace path "/proc/4076575/ns/ipc" caused: lstat /proc/4076575/ns/ipc: no such file or directory: unknown
  Normal   Pulled          88m (x2 over 88m)     kubelet            Container image "nvidia/k8s-device-plugin:v0.10.0" already present on machine
  Normal   Created         88m (x2 over 88m)     kubelet            Created container nvidia-device-plugin-ctr
  Warning  Failed          88m                   kubelet            Error: sandbox container "437bb0ac4e63c34e8a678754a1ac4dd71d72cf2ab27a5555aed5b46f193f849b" is not running
  Warning  BackOff         88m (x7 over 88m)     kubelet            Back-off restarting failed container
  Normal   SandboxChanged  88m (x9 over 88m)     kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  FailedSync      83m                   kubelet            error determining status: rpc error: code = NotFound desc = an error occurred when try to find sandbox: not found
  Normal   Pulled          83m                   kubelet            Container image "nvidia/k8s-device-plugin:v0.10.0" already present on machine
  Normal   Created         83m                   kubelet            Created container nvidia-device-plugin-ctr
  Warning  Failed          83m                   kubelet            Error: sandbox container "ea086ddd278fac26f2d55d82f0f4cafb41e7582796a50d199c1971896b0a886b" is not running
  Normal   SandboxChanged  78m (x278 over 83m)   kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  BackOff         73m (x523 over 83m)   kubelet            Back-off restarting failed container
  Normal   Created         70m                   kubelet            Created container nvidia-device-plugin-ctr
  Warning  Failed          70m                   kubelet            Error: failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:402: getting the final child's pid from pipe caused: EOF: unknown
  Normal   Pulled          70m (x2 over 70m)     kubelet            Container image "nvidia/k8s-device-plugin:v0.10.0" already present on machine
  Warning  BackOff         10m (x2792 over 70m)  kubelet            Back-off restarting failed container
  Normal   SandboxChanged  50s (x3385 over 70m)  kubelet            Pod sandbox changed, it will be killed and re-created.

ctso commented 3 years ago

I can confirm I'm having the same issue here. Works fine on v1.21.6, but does not work on v1.22.3.

brandond commented 3 years ago

I suspect that perhaps the nvidia device plugin isn't compatible with containerd 1.5?

brandond commented 3 years ago

@kralicky have you tried this out at all?

kralicky commented 3 years ago

I have seen the exact errors @paulfantom has and I believe this is related to the clone3/seccomp updates that are in the latest containerd. The workaround for now is to make all pods which use the nvidia container runtime privileged. It is possible that this is an issue on nvidia's end but I am not 100% sure if that is the case.

paulfantom commented 3 years ago

I can confirm that workaround suggested by @kralicky is working. :+1:

elezar commented 3 years ago

@kralicky we recently saw the clone3/seccomp update issue on updated docker packages on Ubuntu (see nvidia-container-runtime#157). We have published updated packages for the NVIDIA Container Toolkit (including the nvidia-container-runtime) to our experimental package repositories and will be promoting these to stable in the near future.

As an alternative to running the containers as privileged, you could update the nvidia-container-toolkit to at least 1.6.0-rc.2.

paulfantom commented 3 years ago

I can confirm that with 1.6.0~rc.3-1 the issue is gone. As such I am closing this bug report. Thank you everyone for your feedback and helping me solving this issue! :100: :+1:

FischerLGLN commented 2 years ago

Just in case someone tries with k3s v1.23.8+k3s2

After installing driver and nvidia-container-toolkit, put only this inside the config.toml

sudo mkdir /var/lib/rancher/k3s/agent/etc/containerd
sudo vim /var/lib/rancher/k3s/agent/etc/containerd/config.toml

(Copied from https://github.com/NVIDIA/k8s-device-plugin#configure-containerd)

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime

And of course add the nvidia-device-plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml

Add a RuntimeClass:

apiVersion: node.k8s.io/v1 # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
  name: nvidia # The name the RuntimeClass will be referenced by
  # RuntimeClass is a non-namespaced resource
handler: nvidia # The name of the corresponding CRI configuration

Add a gpu pod:

apiVersion: v1 
kind: Pod 
metadata: 
 name: gpu-pod 
 namespace: gpus 
spec: 
 restartPolicy: OnFailure 
 runtimeClassName: nvidia 
 containers: 
   - name: cuda-container 
     image: nvidia/cuda:11.0-base 
     command: ["nvidia-smi"] 
     resources: 
       limits: 
         nvidia.com/gpu: 1 # requesting 1 GPU

brandond commented 2 years ago

Providing a containerd config template is only necessary if you want to change the default runtime. If you're using runtimeClassName, all you should need to do is install the runtime package for your OS, then restart K3s.

FischerLGLN commented 2 years ago

@brandond Now I've tried to run the pod without the runtimeClassName and given the containerd config template:

Pod crashloopbackoffs with:

Error: failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

I would assume that:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

this setting makes nvidia the default runtime and is enough, but somehow

 runtimeClassName: nvidia

does more, like setting the correct binary path for nvidia-smi

hansaya commented 2 years ago

I spent lot of time on this and I finally managed to get it to work. @FischerLGLN got close but the issue is you are not suppose to modify /var/lib/rancher/k3s/agent/etc/containerd/config.toml and you suppose to use the ....config.toml.tmpl according to https://rancher.com/docs/k3s/latest/en/advanced/

This is what I did to get it to work

# Install gpg key from nvidia
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install Drivers. Use the latest drivers!
apt-get update && apt-get install -y nvidia-driver-515-server nvidia-container-toolkit nvidia-modprobe
reboot

# Check whether GPU recognized. You might have to restart the node after the driver installation to get this working
nvidia-smi

# Download template from k3d project
sudo wget https://k3d.io/v5.4.1/usage/advanced/cuda/config.toml.tmpl -O /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

# Install nvidia plugin. This is optional. You can simply pass in the env variables to pass in GPU access to pods. However this is a nice way to debug whether you have access to the GPUs on contianerd
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml

Try running nvidia plugin and check the logs. If you see this on the correct node then you have to modify the .tmpl file further.

2022/07/25 18:14:19 Initializing NVML.
2022/07/25 18:14:19 Failed to initialize NVML: could not load NVML library.
2022/07/25 18:14:19 If this is a GPU node, did you set the docker default runtime to `nvidia`

This used to work previously but with K3S v1.23+ I had issues. You will have to modify /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl add:

[plugins.cri.containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

and modify

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runtime.v1.linux"

to

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

This should get everything running. Then either use the nvidia plugin to define resources or if you like me and want to share the GPU accross multiple pods then just add these ENV variables to your pod

  NVIDIA_VISIBLE_DEVICES: all
  NVIDIA_DRIVER_CAPABILITIES: all

brandond commented 2 years ago

Modifying the containerd config template is not necessary. K3s will automatically add runtimes to the containerd config if the nvidia binaries are present on the node when k3s is started. All you need to do is use the RuntimeClass and Pod specs shown in https://github.com/k3s-io/k3s/issues/4391#issuecomment-1181707242

hansaya commented 2 years ago

@brandond hmm I'm happy to test again but I can guarantee you that it doesn't work out of the box

brandond commented 2 years ago

Which part of it doesn't work? The runtime binary detection and addition of runtimes to the containerd config, or your runtimeclass/pod spec making use of it?

hansaya commented 2 years ago

Adding the runtimes to the containerd config. I had the RuntimeClass and the pod configured in a fresh installation. Pod started but without the actual GPU exposed. Until I added config.toml.tmpl file.

brandond commented 2 years ago

The current code checks /usr/bin and /usr/local/nvidia/toolkit for nvidia-container-runtime and nvidia-container-runtime-experimental binaries. I can confirm that k3s finds and adds runtimes for these if they are present. Note that it does NOT change the default runtime and does NOT add a RuntimeClass for you; it is up to you to create one with the correct name and reference it from your pod.

Can you verify the version of k3s you're using, and that you're using the expected binary paths?

[root@centos01 ~]# ln -s /usr/bin/true /usr/bin/nvidia-container-runtime

[root@centos01 ~]# curl -ksL get.k3s.io | sh -
[INFO]  Finding release for channel stable
[INFO]  Using v1.24.3+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.24.3+k3s1/sha256sum-amd64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.24.3+k3s1/k3s
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Creating /usr/local/bin/ctr symlink to k3s
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink from /etc/systemd/system/multi-user.target.wants/k3s.service to /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

[root@centos01 ~]# cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/1d787a9b6122e3e3b24afe621daa97f895d85f2cb9cc66860ea5ff973b5c78f2/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.containerd.runtimes.runc.options]
    SystemdCgroup = false

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

iwaitu commented 2 years ago

I spent lot of time on this and I finally managed to get it to work. @FischerLGLN got close but the issue is you are not suppose to modify /var/lib/rancher/k3s/agent/etc/containerd/config.toml and you suppose to use the ....config.toml.tmpl according to https://rancher.com/docs/k3s/latest/en/advanced/

This is what I did to get it to work
# Install gpg key from nvidia
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install Drivers. Use the latest drivers!
apt-get update && apt-get install -y nvidia-driver-515-server nvidia-container-toolkit nvidia-modprobe
reboot

# Check whether GPU recognized. You might have to restart the node after the driver installation to get this working
nvidia-smi

# Download template from k3d project
sudo wget https://k3d.io/v5.4.1/usage/advanced/cuda/config.toml.tmpl -O /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

# Install nvidia plugin. This is optional. You can simply pass in the env variables to pass in GPU access to pods. However this is a nice way to debug whether you have access to the GPUs on contianerd
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml
Try running nvidia plugin and check the logs. If you see this on the correct node then you have to modify the .tmpl file further.
2022/07/25 18:14:19 Initializing NVML.
2022/07/25 18:14:19 Failed to initialize NVML: could not load NVML library.
2022/07/25 18:14:19 If this is a GPU node, did you set the docker default runtime to `nvidia`
This used to work previously but with K3S v1.23+ I had issues. You will have to modify /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl add:
[plugins.cri.containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
and modify
[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runtime.v1.linux"
to
[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
This should get everything running. Then either use the nvidia plugin to define resources or if you like me and want to share the GPU accross multiple pods then just add these ENV variables to your pod
  NVIDIA_VISIBLE_DEVICES: all
  NVIDIA_DRIVER_CAPABILITIES: all

It work for me . Thank you.

brandond commented 2 years ago

This used to work previously but with K3S v1.23+ I had issues. You will have to modify >/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl add:
[plugins.cri.containerd.runtimes.runc.options]
 BinaryName = "/usr/bin/nvidia-container-runtime"

Don't do this; it will change the default runtime to use the nvidia binary instead of runc. As described above, you should be creating a RuntimeClass and setting the runtimeClassName on pods that you want to use the nvidia container runtime.

flixr commented 2 years ago

@brandond are there any docs on how it should be done and since which k3s version?

brandond commented 2 years ago

Here's what I did to get this working on an Ubuntu node. You should be able to follow similar instructions to get it working on any other distro on any currently supported release of K3s:

Install the nvidia-container repo on the node by following the instructions at: https://nvidia.github.io/libnvidia-container/
Install the nvidia packages (not sure specifically which all are needed, but this worked for me): apt install -y nvidia-container-runtime cuda-drivers-fabricmanager-515 nvidia-headless-515-server
Install K3s, or restart it if already installed: curl -ksL get.k3s.io | sh -
Confirm that the nvidia container runtime has been found by k3s: grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
Deploy a manifest containing the RuntimeClass, Nvidia Device Plugin and Feature Discovery DaemonSets, and a Pod that uses the GPU to run a benchmark: kubectl apply -f https://gist.githubusercontent.com/brandond/33e49bf094712f926c95d683d515ac95/raw/nvidia.yaml

Results:

root@ip-172-31-27-127:~# kubectl logs nbody-gpu-benchmark --tail=10
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Turing" with compute capability 7.5

> Compute 7.5 CUDA device: [Tesla T4]
40960 bodies, total time for 10 iterations: 91.529 ms
= 183.299 billion interactions per second
= 3665.981 single-precision GFLOP/s at 20 flops per interaction

I'm not sure how much of this we should cover in our docs though, as this is all owned by the various Nvidia projects; the only difference necessary to follow their instructions for K3s is the addition of the runtimeClass, since we don't replace the default.

flixr commented 2 years ago

Thanks. What I currently still don't understand is why you recommend to change the "upstream" manifests to add runtimeClassName: nvidia instead of changing the default runtime (which seems easier to me down the road).

brandond commented 2 years ago

Changing the default system runtime based on the autodetected presence of the nvidia container runtime binary is potentially more disruptive.

If the container runtime were made the default, but other packages (such as the libraries, kernel module, and so on) are not properly installed, then the node will be unable to run any pods.

Additionally, it is usually only desired to run some pods with the nvidia runtime; for all of the other pods in the system that aren't going to use GPU, the default runtime is fine. Anyone running GPU pods is already going to be deploying nvidia-specific configuration to their cluster. Asking users to add a field to the pod spec to request the nividia runtime does not seem overly burdensome.

flixr commented 2 years ago

Thanks for your help @brandond ! You are right, not all pods need/use a GPU, but I think if you don't request it, it will not be used... but I clearly need to check on this a bit more. I'm just searching for the "best-practice" setup and one that also requires the least amount of changes in the manifest/helm charts, regardless of whether I'm deploying it on our on-prem k3s cluster, or our customers cluster or in the cloud....

brandond commented 2 years ago

You're correct, pods that don't request GPU won't get one, but if you change the default runtime to nvidia-container-runtime then everything will run with that, instead of with runc - which may lead to unexpected changes in behavior.

I think you should be able to inject the runtimeClass fairly easily with tools like kustomize, if you're looking at ways to do that using tooling instead of manual editing of manifests?

hansaya commented 2 years ago

Side question is there anyway to request a nvidia runtime in pod spec and being able to share a single gpu with multiple pods?

carlwang87 commented 2 years ago

3s will automatically add runtimes to the containerd con

Without pre-installed NVIDIA Container Toolkit and gpu driver， I followed the gpu-operator(v22.9.0) installation guide in k3s(v1.24.3+k3s1) to deploy gpu operator successfully, but I ran the samples from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#running-sample-gpu-applications, it failed. I have to add runtimeClassName: nvidia in the pod spec，so I wonder that how these samples ran succeessfully without runtimeClassName: nvidia.

brandond commented 2 years ago

I wonder that how these samples ran succeessfully without runtimeClassName: nvidia.

They will not run as-is on K3s. You either need to explicitly specify the nvidia runtime class, or modify the containerd config template to use the nvidia container runtime for all pods.

flixr commented 2 years ago

For those who want to set the default runtime to nvidia, here is what works with k3s v1.24.6+k3s1 using containerd 1.6.8-k3s1: Check that the nvidia runtime was detected as @brandond described above. If yes, get the default config.tompl.tmpl from https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go and change it to have

[plugins.cri.containerd]
  default_runtime_name = "nvidia"

similar to what nvidia also describes in the k3s-device-plugin docs. Here is the template I use: config.toml.tmpl

larivierec commented 2 years ago

For those who want to set the default runtime to nvidia, here is what works with k3s v1.24.6+k3s1 using containerd 1.6.8-k3s1: Check that the nvidia runtime was detected as @brandond described above. If yes, get the default config.tompl.tmpl from https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go and change it to have
[plugins.cri.containerd]
  default_runtime_name = "nvidia"
similar to what nvidia also describes in the k3s-device-plugin docs. Here is the template I use: config.toml.tmpl

Hello, i tried this on ubuntu 22.04 with k3s with kubernetes 1.25 and this does not work. Containerd version 1.6.9.

It seems to properly identity proper runtimes though.

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.containerd.runtimes.runc.options]
    SystemdCgroup = true

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia-experimental"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia-experimental".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

Error thrown by nvidia-device-plugin pod ->

  Warning  Failed     99s (x5 over 3m13s)   kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process request: unknown

brandond commented 2 years ago

Can you try my steps documented at https://github.com/k3s-io/k3s/issues/4391#issuecomment-1233314825 instead?

larivierec commented 2 years ago

Can you try my steps documented at #4391 (comment) instead?

So I just did a clean agent install on the GPU node. following your steps. I get the same config file generated, also instead of using the helm chart install of nvidia-device-plugin I used your URL. Once the pod starts on the node with the GPU I receive the same error.

I tried running nvidia-smi and within normal containerd here is the output.

christopher@k8s-gpu:~$ sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi
Thu Nov  3 22:14:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8    12W / 151W |      2MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

As soon as it's in kubernetes, it doesn't seem to work

brandond commented 2 years ago

ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi

That tag doesn't seem to exist: https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=11.0-base

Did you mean 11.0.3-base? Even if so, that image doesn't seem to contain the nvidia-smi binary that you're trying to run:

brandond@dev01:~$ docker run --rm -it docker.io/nvidia/cuda:11.0.3-base nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.

So I'm not sure how these steps are working for you at all. Additionally, 11.0 is quite old; your command shows that your driver is actually using cuda 11.7, and 11.8 is the current release.

Can you try checking the output of the nbody-gpu-benchmark pod, as shown in my example, instead of running other tests using deprecated examples and commands?

larivierec commented 2 years ago

the image 11.0-base may have been previously pulled awhile back. I just tried with 11.0.3 and it worked fine as well. edit: also tried with 11.8 and also worked fine.

I tried running the nbody-gpu-benchmark it's unscheduleable because the nvidia-device-plugin pod is unable to complete.

nbody is looking for nvidia.com/gpu

  Warning  FailedScheduling  8s    default-scheduler  0/4 nodes are available: 4 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.

flixr commented 2 years ago

If it is of interest: I actually abandoned the approach of setting the default runtime to nvidia (although that worked) again and went with @brandond's recommendation of explicitly setting runtimeClassName since longhorn had problems running under the nvidia runtime... Although that is extra work for my GPU workloads and I had to find workarounds to specify the runtimeClassName in flyte, this definitely seems to be the route to go.

brandond commented 2 years ago

@larivierec

the image 11.0-base may have been previously pulled awhile back.

Hmm, that'd have to be from quite a while ago then. What version of K3s are you currently using?

the nvidia-device-plugin pod is unable to complete.

Why not?

I just tried with 11.0.3 and it worked fine as well. edit: also tried with 11.8 and also worked fine.

How is this working? Which specific image are you using? Neither docker.io/nvidia/cuda:11.0.3-base nor any of the newer versions I've tried appear to contain the nvidia-smi binary.

larivierec commented 2 years ago

@larivierec

the image 11.0-base may have been previously pulled awhile back.

Hmm, that'd have to be from quite a while ago then. What version of K3s are you currently using?

the nvidia-device-plugin pod is unable to complete.

Why not?

I just tried with 11.0.3 and it worked fine as well.

edit: also tried with 11.8 and also worked fine.

How is this working? Which specific image are you using? Neither docker.io/nvidia/cuda:11.0.3-base nor any of the newer versions I've tried appear to contain the nvidia-smi binary.

Hmmmm, with regards to 11.X I don't know maybe perhaps I made modifications to the system containerd?

The k3s version is latest stable version: 1.25-k3s.

The Daemonset pod doesn't start because of the error I linked above sadly 😓

Edit; i was also using the ubuntu22.04 images, I don't know if that makes a difference

larivierec commented 2 years ago

@brandond thanks for the help yesterday, turns out the binary that was being set in the config.toml by k3s was not the one that I installed with the package manager.

k3s-killall.sh
delete /usr/local/nvidia
restart k3s so it picks up the correct nvidia-container-runtime which was this one /usr/bin/nvidia-container-runtime

Cheers :beer:

kralicky commented 2 years ago

@larivierec if I remember correctly, the /usr/local/nvidia runtime is installed by the gpu operator, and will be selected over the package manager-installed runtime if it exists. If you're still using the gpu operator it might try to install its own runtime again, so look out for that.

larivierec commented 2 years ago

That would make sense, I had installed it previously however, when I removed it awhile back it didn't seem to clean itself up!

Thanks for the heads up

mxmathieu commented 1 year ago

Hello, Personnaly it works with the following setup (works on Ubuntu 22.04, Nvidia drivers 515 and K3s 1.25) :

As @brandond said, we need to create a runtimeClass resource:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Then, I prefer deploy the helm chart from directly from Nvidia repo

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvdp --create-namespace --version 0.12.3 --set=runtimeClassName=nvidia

I hope this will help.

k3s-io / k3s

Issues with nvidia device plugin #4391