Closed nakkaya closed 3 months ago
Thanks for the report @nakkaya. I'm unable to replicate on GKE cluster with 2x V100 and 2x T4s on the latest commit on master c9f575cf3d127ce6387795b0400a2de85c143377
.
Before I try on a k3s 2080/3090 cluster:
nvidia.com/gpu
resource in kubectl describe nodes
? Can you confirm you see >1 GPU there?pip uninstall skypilot skypilot-nightly; pip install -U skypilot-nightly
)/master branch?@romilbhardwaj Same issue on nightly, here are the outputs for the questions.
λ sky -c
skypilot, commit 20493fb61601ce00d612073ebad4706bbccdf487
λ sky -v
skypilot, version 1.0.0.dev20240418
λ sky show-gpus --cloud kubernetes
COMMON_GPU AVAILABLE_QUANTITIES
V2080 1
V3090 1
Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
λ kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
NAME GPU
beowulf-master <none>
beowulf-nfs <none>
beowulf-node-1 <none>
beowulf-node-2 <none>
beowulf-node-3 <none>
lab-omid 1
lab-rabiyev 1
lab-workstation-1 1
lab-workstation-2 1
Thanks @nakkaya. This is expected behavior - AVAILABLE_QUANTITIES
shows the max GPU quantity available on a single node. Since your two GPUs are spread across nodes, they will show up as 1 AVAILABLE_QUANTITY
.
You should still be able to run two concurrent tasks, each requiring one GPU - sky launch -c myclus1 --gpus V2080:1; sky launch -c myclus1 --gpus V2080:1
.
To use multiple GPUs in the same SkyPilot cluster, you can run sky launch -c largeclus --gpus V2080:1 --num-nodes 2
. Here the gpus
field specifies the number of GPUs per node, and --num-nodes
specifies using two nodes. Is this helpful?
One idea is to change the fields a bit:
COMMON_GPU PER_NODE CLUSTER_TOTAL
V2080 1 2
V3090 1 2
Would that be better @nakkaya?
One idea is to change the fields a bit:
COMMON_GPU PER_NODE CLUSTER_TOTAL V2080 1 2 V3090 1 2
Would that be better @nakkaya?
I think that would better help avoid confusion.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.
I am using skypilot in my home lab with 4 nodes and 4 GPUs all managed by Rancher (k3s).
GPU setup on the nodes are working, I can execute non skypilot GPU workloads on them fine. Each node is tagged using
when I run
sky show-gpus --cloud kubernetes
only one of each GPU type is listed in the output. If I replace one of the accelerator labels to some arbitrary value it is immediately picked up byshow-gpus
but duplicate values are only shown as quantity 1. Usingpython -m sky.utils.kubernetes.gpu_labeler
to do the tagging does not make any difference same issue.Earlier skypilot versions worked as expected. I was able to launch 2 node clusters fine.
Version & Commit info:
sky -v
: skypilot, version 0.5.0sky -c
: skypilot, commit 1e4e871398e121708d3e9809c0a98b905bf9f212