Open lmyslinski opened 1 year ago
After a deeper dive I've discovered that this is due to the device plugin being marked as system-critical
, which on GKE is not allowed in namespace other than kube-system
. I've managed to deploy the pods, however as of the latest release (0.13) the driver seems to be crashing due to lack of nvidia-smi
in the path. This seems like an issue with the docker image in the latest release, is there perhaps a previous release I could use?
Hi @lmyslinski, thank you for raising the issue!
The lack of nvidia-smi
in the path generally occurs when GPU support is not enabled in the Docker container. This can be related to either the NVIDIA drivers on the host (maybe a version mismatch) or some issue with the nvidia-container-toolkit setup.
For L4 GPUs, GKE requires version 1.22.17-gke.5400 or later and NVIDIA driver version 525 or later (here you can find the full requirements). I've seen you're using v1.24.11-gke.1000, so my guess is that the problem is related to NVIDIA drivers.
Could you please try either changing GPU model or upgrading your GKE?
Hello, I believe I'm running into a similar issue. The whole system works well and can utilize the GPUs using a standard gpu-operator / device plugin install:
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --version 23.3.2
Now, I'm trying to enable MPS using your fork of the nvidia-device-plugin. I uninstall gpu-operator.
Then, I reinstall it without the standard device plugin using:
helm install -n gpu-operator gpu-operator nvidia/gpu-operator --version 23.3.2 --set devicePlugin.enabled=false
Then, I install your device plugin:
helm install oci://ghcr.io/nebuly-ai/helm-charts/nvidia-device-plugin \
--version 0.13.0 \
--generate-name \
-n nebuly-nvidia \
--create-namespace
It install successfully, but the set-compute-mode
container is failing to start with error:
failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: exe │
│ c: "nvidia-smi": executable file not found in $PATH: unknown
Cluster data:
K8s Rev: {Client: v1.27.2, Server: v1.23.17+k3s1}
GPU used: Nvidia A4000
nvidia-device-plugin-0.13.0
I have the same problem as @willcray
I have the same problem as @willcray
@santurini I don't remember anything regarding this setup since I've long moved on, but I've created a detailed post regarding all of the GPU Operator stuff:
https://lmyslinski.com/posts/gpu-operator-guide/
At quick glance, if nvidia-smi is not found that means that the driver is not working / not installed
I have the same problem as @willcray
@santurini I don't remember anything regarding this setup since I've long moved on, but I've created a detailed post regarding all of the GPU Operator stuff:
https://lmyslinski.com/posts/gpu-operator-guide/
At quick glance, if nvidia-smi is not found that means that the driver is not working / not installed
Have you found a different solution for enabling MPS in a kubernetes cluster?
Hi, I'm trying to setup MPS partitioning on GKE, but I can't get the k8s-device-plugin to work. The plugin gets installed correctly, but it never starts any driver pods.
Cluster data:
The node only has the following taints:
It's also properly labeled as
The regular nvidia device plugin has worked just fine before I pushed it out with nodeSelectors on the default daemonset injected on GKE.
The nebuly plugin however is stuck at 0 pods:
Your documentation mentions that in order to avoid duplicate drivers on nodes, we can configure affinity on the prexisting nvidia driver to avoid scheduling both on the nodes. I've done that for the GKE driver daemonset, but that results in a container that's always stuck in
creating
. Not a big deal, but I just want to confirm that this is expected.Here's what pods I currently have on the GPU node:
Is there anything I'm doing incorrectly here? Afaik it's not possible to remove the default nvidia driver from the cluster, as it's automatically injected by GKE. Please let me know if there's anything I could do to solve this, I'd love to start using your stuff. Thanks a lot of your time.