skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.51k stars 464 forks source link

[k8s] Nodes with same GPU type are not detected #3448

Closed nakkaya closed 2 weeks ago

nakkaya commented 4 months ago

I am using skypilot in my home lab with 4 nodes and 4 GPUs all managed by Rancher (k3s).

GPU setup on the nodes are working, I can execute non skypilot GPU workloads on them fine. Each node is tagged using

kubectl label nodes lab-workstation-1 skypilot.co/accelerator=v3090
kubectl label nodes lab-workstation-2 skypilot.co/accelerator=v3090
kubectl label nodes lab-omid skypilot.co/accelerator=v2080
kubectl label nodes lab-rabiyev skypilot.co/accelerator=v2080

when I run sky show-gpus --cloud kubernetes only one of each GPU type is listed in the output. If I replace one of the accelerator labels to some arbitrary value it is immediately picked up by show-gpus but duplicate values are only shown as quantity 1. Using python -m sky.utils.kubernetes.gpu_labeler to do the tagging does not make any difference same issue.

COMMON_GPU  AVAILABLE_QUANTITIES  
V2080       1                     
V3090       1                     

Earlier skypilot versions worked as expected. I was able to launch 2 node clusters fine.

Version & Commit info:

romilbhardwaj commented 4 months ago

Thanks for the report @nakkaya. I'm unable to replicate on GKE cluster with 2x V100 and 2x T4s on the latest commit on master c9f575cf3d127ce6387795b0400a2de85c143377.

Before I try on a k3s 2080/3090 cluster:

  1. What's the capacity of nvidia.com/gpu resource in kubectl describe nodes? Can you confirm you see >1 GPU there?
  2. Can you try with the latest nightly release (pip uninstall skypilot skypilot-nightly; pip install -U skypilot-nightly)/master branch?
nakkaya commented 4 months ago

@romilbhardwaj Same issue on nightly, here are the outputs for the questions.

λ sky -c
skypilot, commit 20493fb61601ce00d612073ebad4706bbccdf487

λ sky -v
skypilot, version 1.0.0.dev20240418

λ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
V2080       1                     
V3090       1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

λ kubectl get nodes -o=custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
NAME                GPU
beowulf-master      <none>
beowulf-nfs         <none>
beowulf-node-1      <none>
beowulf-node-2      <none>
beowulf-node-3      <none>
lab-omid            1
lab-rabiyev         1
lab-workstation-1   1
lab-workstation-2   1
romilbhardwaj commented 4 months ago

Thanks @nakkaya. This is expected behavior - AVAILABLE_QUANTITIES shows the max GPU quantity available on a single node. Since your two GPUs are spread across nodes, they will show up as 1 AVAILABLE_QUANTITY.

You should still be able to run two concurrent tasks, each requiring one GPU - sky launch -c myclus1 --gpus V2080:1; sky launch -c myclus1 --gpus V2080:1.

To use multiple GPUs in the same SkyPilot cluster, you can run sky launch -c largeclus --gpus V2080:1 --num-nodes 2. Here the gpus field specifies the number of GPUs per node, and --num-nodes specifies using two nodes. Is this helpful?

romilbhardwaj commented 4 months ago

One idea is to change the fields a bit:

COMMON_GPU  PER_NODE  CLUSTER_TOTAL
V2080       1         2
V3090       1         2

Would that be better @nakkaya?

nakkaya commented 4 months ago

One idea is to change the fields a bit:

COMMON_GPU  PER_NODE  CLUSTER_TOTAL
V2080       1         2
V3090       1         2

Would that be better @nakkaya?

I think that would better help avoid confusion.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been stalled for 10 days with no activity.