skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.27k stars 430 forks source link

[k8s] [GKE] Fail to request T4 instance #3506

Open asaiacai opened 2 months ago

asaiacai commented 2 months ago

I'm running a single T4 node on GKE. Nodes are properly labeled as shown below and sky show-gpus --cloud kubernetes is also correct but fails to launch.

(sky) gcpuser@gfd-ebd1-head-evggxnxq-compute:~/skypilot$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4
                      cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...
(sky) gcpuser@gfd-ebd1-head-evggxnxq-compute:~/skypilot$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
(sky) gcpuser@gfd-ebd1-head-evggxnxq-compute:~/skypilot$ sky launch --cloud kubernetes --gpus T4
I 05-02 08:48:17 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'T4': 1}).

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Version & Commit info:

romilbhardwaj commented 2 months ago

Thanks for the report @asaiacai - I'm unable to reproduce this on https://github.com/skypilot-org/skypilot/commit/d27e0ff83c56983920a655fbeaddc96b2758752e. Can you share a reproduction script, a bit more about how you created the cluster and the full output of sky launch --cloud kubernetes --gpus T4?

Here's how I created my cluster:

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=gkeusc4

$ gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.27.12-gke.1115000" --release-channel "regular" --machine-type "n1-standard-16" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES
T4          1

$ sky launch --cloud kubernetes --gpus T4
# My cluster ran as expected.
asaiacai commented 2 months ago

I have an existing GKE cluster cluster-1 that I created a new nodepool adding one T4 instance. I shouldn't need to purge ~/.sky right?

$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=cluster-1

$ gcloud beta container node-pools create t4-nodepool  --cluster=${CLUSTER_NAME}  --zone=us-central1-c  --node-locations=us-central1-c     --num-nodes=1     --total-min-nodes=1     --total-max-nodes=1     --reservation-affinity=none     --no-enable-autorepair     --location-policy=ANY   --machine-type=n1-standard-2     --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Note: Starting in GKE 1.30, if you don't specify a driver version, GKE installs the default GPU driver for your node's GKE version.
Creating node pool t4-nodepool...done.                                                                                             
Created [https://container.googleapis.com/v1beta1/projects/trainy-test/zones/us-central1-c/clusters/cluster-1/nodePools/t4-nodepool].
NAME         MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
t4-nodepool  n1-standard-2  100           1.28.7-gke.1026000

$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4
                      cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...

$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
$ sky launch --cloud kubernetes --gpus T4
I 05-02 18:08:39 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: Kubernetes({'T4': 1}).

To fix: relax or change the resource requirements.

$ sky -c
skypilot, commit d27e0ff83c56983920a655fbeaddc96b2758752e
romilbhardwaj commented 2 months ago

Ah looks like your instance does not have enough memory to satisfy the default resource request of 2 CPUs and 8GB memory. Note that some CPU millicores and memory goes to k8s components, so n1-standard-2 with 2 CPUs and 7.5GB memory would not be able to fit the default resources requested by SkyPilot.

This is surfaced in debug logs (export SKYPILOT_DEBUG=1):

D 05-02 13:40:51 kubernetes.py:344] Instance type 2CPU--8GB--1T4 does not fit in the Kubernetes cluster. Reason: GPU nodes with T4 do not have enough CPU and/or memory. Maximum resources found on a single node: 2.0 CPUs, 7.3G Memory

Explicitly specifying a lower CPU/mem request (e.g., sky launch --cloud kubernetes --gpus T4 --cpus 1 --memory 2) should work.

TODO for us is to make the log messages better - perhaps resources: Kubernetes({'T4': 1}) should have shown the CPUs and memory requested. Leaving the issue open for us to fix logging. Thanks for the report!

romilbhardwaj commented 2 months ago

Bumping the priority for this - another user ran into issues with SkyPilot unable to use resources on k8s and had to use SKYPILOT_DEBUG=1 to surface the error. This should be logged to info.