rootless-containers / usernetes

Kubernetes without the root privileges
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2033-kubelet-in-userns-aka-rootless
Apache License 2.0
865 stars 58 forks source link

nvidia GPU not support due to cgroup mountpoint not found #274

Closed cheungsuifai closed 1 year ago

cheungsuifai commented 1 year ago

I planned to test the GPU availability on u7s. Due to my GPU device is Nvidia Tesla T4, I tried to deploy the nvidia device plugin daemonset. For this, I switched the containerd runtime from origin crun to nvidia's (v1.13).

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

But after that, the Pod failed to start with the below log:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: container_[linux.go:348](http://linux.go:348/): starting container process caused "process_[linux.go:279](http://linux.go:279/): applying cgroup configuration for process caused \"mountpoint for cgroup not found\"": unknown

I guessed the reason is after cgroup v2 enabled, there was no /sys/fs/cgroup/devices mounted.

So I set no-cgroups = true in /etc/nvidia-container-runtime/config.toml .

But "mountpoint for cgroup not found" problem still there.

AkihiroSuda commented 1 year ago

I think you should report this to https://github.com/NVIDIA/nvidia-container-runtime

cheungsuifai commented 1 year ago

I think you should report this to https://github.com/NVIDIA/nvidia-container-runtime

Sorry to raise the issue here. But actually I have gone through the issue of nvidia-container-runtime. And I have found that you also engage in some issues there. In that issue thread, it seems you setup u7s cluster supported GPU which based on nividia container runtime with setting no-groups = true.

is that true?