nebuly-ai / nos

Module to Automatically maximize the utilization of GPU resources in a Kubernetes cluster through real-time dynamic partitioning and elastic quotas - Effortless optimization at its finest!
https://www.nebuly.com/
Apache License 2.0
632 stars 36 forks source link

Nebuly k8s-device-plugin not starting on GKE #36

Open lmyslinski opened 1 year ago

lmyslinski commented 1 year ago

Hi, I'm trying to setup MPS partitioning on GKE, but I can't get the k8s-device-plugin to work. The plugin gets installed correctly, but it never starts any driver pods.

Cluster data:

The node only has the following taints:

Taints:             nvidia.com/gpu=present:NoSchedule

It's also properly labeled as

nos.nebuly.com/gpu-partitioning=mps

The regular nvidia device plugin has worked just fine before I pushed it out with nodeSelectors on the default daemonset injected on GKE.

The nebuly plugin however is stuck at 0 pods:

k get ds -n nebuly-nvidia

NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                         AGE
nvidia-device-plugin-1684693222   0         0         0       0            0           nos.nebuly.com/gpu-partitioning=mps   33m

Your documentation mentions that in order to avoid duplicate drivers on nodes, we can configure affinity on the prexisting nvidia driver to avoid scheduling both on the nodes. I've done that for the GKE driver daemonset, but that results in a container that's always stuck in creating. Not a big deal, but I just want to confirm that this is expected.

Here's what pods I currently have on the GPU node:

  Namespace                   Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                       ------------  ----------  ---------------  -------------  ---
  kube-system                 fluentbit-gke-nzm72                                        100m (2%)     0 (0%)      200Mi (1%)       500Mi (3%)     23m
  kube-system                 gke-metrics-agent-ghmdm                                    8m (0%)       0 (0%)      110Mi (0%)       110Mi (0%)     23m
  kube-system                 kube-proxy-gke-xxx-gke-workspace-gpu-95e23864-6fwc    100m (2%)     0 (0%)      0 (0%)           0 (0%)         23m
  kube-system                 nvidia-gpu-device-plugin-x6l9c                             50m (1%)      0 (0%)      50Mi (0%)        50Mi (0%)      23m
  kube-system                 pdcsi-node-dbxns                                           10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     23m

Is there anything I'm doing incorrectly here? Afaik it's not possible to remove the default nvidia driver from the cluster, as it's automatically injected by GKE. Please let me know if there's anything I could do to solve this, I'd love to start using your stuff. Thanks a lot of your time.

lmyslinski commented 1 year ago

After a deeper dive I've discovered that this is due to the device plugin being marked as system-critical, which on GKE is not allowed in namespace other than kube-system. I've managed to deploy the pods, however as of the latest release (0.13) the driver seems to be crashing due to lack of nvidia-smi in the path. This seems like an issue with the docker image in the latest release, is there perhaps a previous release I could use?

Telemaco019 commented 1 year ago

Hi @lmyslinski, thank you for raising the issue!

The lack of nvidia-smi in the path generally occurs when GPU support is not enabled in the Docker container. This can be related to either the NVIDIA drivers on the host (maybe a version mismatch) or some issue with the nvidia-container-toolkit setup.

For L4 GPUs, GKE requires version 1.22.17-gke.5400 or later and NVIDIA driver version 525 or later (here you can find the full requirements). I've seen you're using v1.24.11-gke.1000, so my guess is that the problem is related to NVIDIA drivers.

Could you please try either changing GPU model or upgrading your GKE?

willcray commented 1 year ago

Hello, I believe I'm running into a similar issue. The whole system works well and can utilize the GPUs using a standard gpu-operator / device plugin install:

helm install -n gpu-operator gpu-operator nvidia/gpu-operator --version 23.3.2

Now, I'm trying to enable MPS using your fork of the nvidia-device-plugin. I uninstall gpu-operator.

Then, I reinstall it without the standard device plugin using:

helm install -n gpu-operator gpu-operator nvidia/gpu-operator --version 23.3.2 --set devicePlugin.enabled=false

Then, I install your device plugin:

helm install oci://ghcr.io/nebuly-ai/helm-charts/nvidia-device-plugin \
  --version 0.13.0 \
  --generate-name \
  -n nebuly-nvidia \
  --create-namespace

It install successfully, but the set-compute-mode container is failing to start with error:

failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: exe │
│ c: "nvidia-smi": executable file not found in $PATH: unknown

Cluster data:

K8s Rev: {Client: v1.27.2, Server: v1.23.17+k3s1}
GPU used: Nvidia A4000
nvidia-device-plugin-0.13.0
santurini commented 1 month ago

I have the same problem as @willcray

lmyslinski commented 1 month ago

I have the same problem as @willcray

@santurini I don't remember anything regarding this setup since I've long moved on, but I've created a detailed post regarding all of the GPU Operator stuff:

https://lmyslinski.com/posts/gpu-operator-guide/

At quick glance, if nvidia-smi is not found that means that the driver is not working / not installed

santurini commented 1 month ago

I have the same problem as @willcray

@santurini I don't remember anything regarding this setup since I've long moved on, but I've created a detailed post regarding all of the GPU Operator stuff:

https://lmyslinski.com/posts/gpu-operator-guide/

At quick glance, if nvidia-smi is not found that means that the driver is not working / not installed

Have you found a different solution for enabling MPS in a kubernetes cluster?