GKE: Pods cannot access/detect GPU device + drive on GPU nodes

cjidboon94 commented 1 year ago

When trying to setup nvshare on GKE, installation goes fine and scheduling pods (e.g. the test pods from the README or a simple cuda pod that runs nvidia-smi) goes fine. nvshare.com/gpu gets consumed. However, pods error with nvidia-smi is not found or in the case of e.g. from the pytorch small pod: Traceback (most recent call last): File "/pytorch-add-small.py", line 29, in <module> device = torch.cuda.current_device() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 479, in current_device _lazy_init() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

When scheduling the pod by requesting nvidia.com/gpu, the GPU is visible and the drivers + nvidia-smi are available.

Setup: GKE k8s version: 1.25.10-gke.2700 nvidia-gpu-device-plugin: GKE's own GPU device plugin

How to reproduce:

Install nvidia-driver-installer daemonset to acquire drivers on nodes as per https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
Install nvshare daemonsets according to readme
Add the following env var to nvshare-device-plugin daemonset as GKE's gpu-device-plugin does not expose this env var and nvshare-device-plugin depends on it:
```
   - name: NVIDIA_VISIBLE_DEVICES
     value: "0"
```

Optional: Add the following affinity to nvshare-device-plugin daemonset so that pods only get scheduled on GPU nodes.


affinity:
nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
      - key: cloud.google.com/gke-accelerator
        operator: Exists

Deploy a pod that requests an nvshare.com/gpu resource:

apiVersion: v1
kind: Pod 
metadata:
name: test-nvshare
spec:
restartPolicy: OnFailure\
tolerations:
- key: nvidia.com/gpu
effect: NoSchedule
operator: Exists
containers:
  - name: test-nvshare
    env:
    - name: NVSHARE_DEBUG
      value: "1"
    image: nvidia/cuda:11.0.3-base-ubi7
    command:
    - bash
    - -c
    - |
      /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
    resources:
      limits:
        nvshare.com/gpu: 1

Expectation when checking logs: Get GPU information GPU 0: NVIDIA L4 (UUID: GPU-7e0c893c-3254-dfa8-db40-73942c3de761) (This is what you see when scheduling with a nvidia.com/gpu request) Actual output: bash: /usr/local/nvidia/bin/nvidia-smi: No such file or directory

grgalex commented 1 year ago

@cjidboon94

Unfortunately, nvshare currently strictly depends on NVIDIA's upstream K8s device plugin [1].

This is because nvshare's implementation is strictly coupled with NVIDIA's container runtime.

When I have some time next week, I will elaborate on this fully.

A short summary is that nvshare-device-plugin sets the NVIDIA_VISIBLE_DEVICES environment var (or its symbolic /dev/null mount alternative) in containers that request nvshare.com/gpu. NVIDIA's container runtime, which is a runc hook that runs the containers on the node reads this environment variable and mounts the necessary files (libaries, device nodes, binaries [such as nvidia-smi]) into the container.

Without NVIDIA's device plugin, containers requesting a nvshare.com/gpu device will not see the device exposed when running and will fail.

TL;DR:

For the time being, nvidia-device-plugin [1] is a strict prerequisite for operatingnvshare on Kubernetes.

[1] https://github.com/NVIDIA/k8s-device-plugin

cjidboon94 commented 1 year ago

Thanks for clarifying. Will see if I can change GKE's device plugin easily to Nvidia's upstream and then get the rest to work

grgalex / nvshare

GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

TL;DR: