grgalex / nvshare

Practical GPU Sharing Without Memory Size Constraints
Apache License 2.0
226 stars 23 forks source link

GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

Open cjidboon94 opened 1 year ago

cjidboon94 commented 1 year ago

When trying to setup nvshare on GKE, installation goes fine and scheduling pods (e.g. the test pods from the README or a simple cuda pod that runs nvidia-smi) goes fine. nvshare.com/gpu gets consumed. However, pods error with nvidia-smi is not found or in the case of e.g. from the pytorch small pod: Traceback (most recent call last): File "/pytorch-add-small.py", line 29, in <module> device = torch.cuda.current_device() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 479, in current_device _lazy_init() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

When scheduling the pod by requesting nvidia.com/gpu, the GPU is visible and the drivers + nvidia-smi are available.

Setup: GKE k8s version: 1.25.10-gke.2700 nvidia-gpu-device-plugin: GKE's own GPU device plugin

How to reproduce:

Expectation when checking logs: Get GPU information GPU 0: NVIDIA L4 (UUID: GPU-7e0c893c-3254-dfa8-db40-73942c3de761) (This is what you see when scheduling with a nvidia.com/gpu request) Actual output: bash: /usr/local/nvidia/bin/nvidia-smi: No such file or directory

grgalex commented 1 year ago

@cjidboon94

Unfortunately, nvshare currently strictly depends on NVIDIA's upstream K8s device plugin [1].

This is because nvshare's implementation is strictly coupled with NVIDIA's container runtime.

When I have some time next week, I will elaborate on this fully.

A short summary is that nvshare-device-plugin sets the NVIDIA_VISIBLE_DEVICES environment var (or its symbolic /dev/null mount alternative) in containers that request nvshare.com/gpu. NVIDIA's container runtime, which is a runc hook that runs the containers on the node reads this environment variable and mounts the necessary files (libaries, device nodes, binaries [such as nvidia-smi]) into the container.

Without NVIDIA's device plugin, containers requesting a nvshare.com/gpu device will not see the device exposed when running and will fail.

TL;DR:

For the time being, nvidia-device-plugin [1] is a strict prerequisite for operatingnvshare on Kubernetes.

[1] https://github.com/NVIDIA/k8s-device-plugin

cjidboon94 commented 1 year ago

Thanks for clarifying. Will see if I can change GKE's device plugin easily to Nvidia's upstream and then get the rest to work