Open cjidboon94 opened 1 year ago
@cjidboon94
Unfortunately, nvshare
currently strictly depends on NVIDIA's upstream K8s device plugin [1].
This is because nvshare
's implementation is strictly coupled with NVIDIA's container runtime.
When I have some time next week, I will elaborate on this fully.
A short summary is that nvshare-device-plugin
sets the NVIDIA_VISIBLE_DEVICES
environment var (or its symbolic /dev/null
mount alternative) in containers that request nvshare.com/gpu
. NVIDIA's container runtime, which is a runc
hook that runs the containers on the node reads this environment variable and mounts the necessary files (libaries, device nodes, binaries [such as nvidia-smi
]) into the container.
Without NVIDIA's device plugin, containers requesting a nvshare.com/gpu
device will not see the device exposed when running and will fail.
For the time being, nvidia-device-plugin
[1] is a strict prerequisite for operatingnvshare
on Kubernetes.
Thanks for clarifying. Will see if I can change GKE's device plugin easily to Nvidia's upstream and then get the rest to work
When trying to setup nvshare on GKE, installation goes fine and scheduling pods (e.g. the test pods from the README or a simple cuda pod that runs
nvidia-smi
) goes fine.nvshare.com/gpu
gets consumed. However, pods error withnvidia-smi
is not found or in the case of e.g. from the pytorch small pod:Traceback (most recent call last): File "/pytorch-add-small.py", line 29, in <module> device = torch.cuda.current_device() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 479, in current_device _lazy_init() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
When scheduling the pod by requesting
nvidia.com/gpu
, the GPU is visible and the drivers + nvidia-smi are available.Setup: GKE k8s version: 1.25.10-gke.2700 nvidia-gpu-device-plugin: GKE's own GPU device plugin
How to reproduce:
Install nvidia-driver-installer daemonset to acquire drivers on nodes as per https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
Install nvshare daemonsets according to readme
Add the following env var to
nvshare-device-plugin
daemonset as GKE's gpu-device-plugin does not expose this env var and nvshare-device-plugin depends on it:Optional: Add the following affinity to
nvshare-device-plugin
daemonset so that pods only get scheduled on GPU nodes.Deploy a pod that requests an
nvshare.com/gpu
resource:Expectation when checking logs: Get GPU information
GPU 0: NVIDIA L4 (UUID: GPU-7e0c893c-3254-dfa8-db40-73942c3de761)
(This is what you see when scheduling with anvidia.com/gpu
request) Actual output:bash: /usr/local/nvidia/bin/nvidia-smi: No such file or directory