kubernetes / minikube

Run Kubernetes locally
https://minikube.sigs.k8s.io/
Apache License 2.0
28.73k stars 4.81k forks source link

rootless docker: nvidia-device-plugin is failing with crashloopbackoff #18952

Open nitishkumar71 opened 1 month ago

nitishkumar71 commented 1 month ago

What Happened?

I am trying to create a minikube cluster with nvidia GPU using docker driver. I have followed all the instructions mentioned into docs. On Using GPU with docker container directly it works as shown below

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Thu May 23 19:32:58 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8               1W /  50W |      6MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

But when I try to create a minikube cluster with GPU support nvidia-device-plugin-daemonset pod is failing due to below error

failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown

Command I am using to create Cluster and output for the same

minikube start --docker-opt="default-ulimit=nofile=102400:102400" --profile gputest --driver docker --container-runtime docker --gpus all --cpus=4 --memory='20g' 
πŸ˜„  [gputest] minikube v1.33.1 on Ubuntu 22.04
✨  Using the docker driver based on user configuration
πŸ“Œ  Using rootless Docker driver
πŸ‘  Starting "gputest" primary control-plane node in "gputest" cluster
🚜  Pulling base image v0.0.44 ...
πŸ”₯  Creating docker container (CPUs=4, Memory=20480MB) ...
🐳  Preparing Kubernetes v1.30.0 on Docker 26.1.1 ...
    β–ͺ opt default-ulimit=nofile=102400:102400
    β–ͺ Generating certificates and keys ...
    β–ͺ Booting up control plane ...
    β–ͺ Configuring RBAC rules ...
πŸ”—  Configuring bridge CNI (Container Networking Interface) ...
πŸ”Ž  Verifying Kubernetes components...
    β–ͺ Using image nvcr.io/nvidia/k8s-device-plugin:v0.15.0
    β–ͺ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: nvidia-device-plugin, storage-provisioner, default-storageclass
πŸ„  Done! kubectl is now configured to use "gputest" cluster and "default" namespace by default

Attach the log file

minikube_logs.txt

Operating System

Ubuntu

Driver

Docker

medyagh commented 1 month ago

@ComradeProgrammer do you mind taking a look ?

medyagh commented 1 month ago

@nitishkumar71 I noticed you are using the rootless docker, do you mind trying with normal docker ? it might be that it is not supported for rootless

the "operation not permitted" error related to cgroups indicates a permissions problem. Docker needs access to manipulate cgroup settings for proper device management and isolation. this could be rootleess docker does not have permission to access that

additionally acording to our docs https://minikube.sigs.k8s.io/docs/drivers/docker/ it is recommended to try rootless docker with "containerd" --container-runtime flag to β€œcontainerd”.

so you if you have to use rootless, you might also wanna try it with containerd runtime instead

medyagh commented 1 month ago

/triage needs-information /kind support

nitishkumar71 commented 1 month ago

@medyagh Thanks for pointing it out. Running docker in root mode did worked. Since, GPU is only supported for docker container runtime only. Trying to use it with containerd gives proper error message.

I think use of root mode for docker should be highlighted in docs.. I can send a PR for it.

medyagh commented 1 month ago

thank you for confirming, we should have guard that if user has rootless docker they should not be able to enable gpu in first place or at least give them warmnining that this is not gonna work and they need Rooted Docker

xcarolan commented 1 month ago

/assign

AkihiroSuda commented 2 weeks ago

this is not gonna work and they need Rooted Docker

NVIDIA runtime is known to work with Rootless Docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#rootless-mode

sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place

Haven't tried with minikube though