Open asaiacai opened 2 months ago
Thanks for the report @asaiacai - I'm unable to reproduce this on https://github.com/skypilot-org/skypilot/commit/d27e0ff83c56983920a655fbeaddc96b2758752e. Can you share a reproduction script, a bit more about how you created the cluster and the full output of sky launch --cloud kubernetes --gpus T4
?
Here's how I created my cluster:
$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=gkeusc4
$ gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.27.12-gke.1115000" --release-channel "regular" --machine-type "n1-standard-16" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
cloud.google.com/gke-accelerator=nvidia-tesla-t4
$ sky show-gpus --cloud kubernetes
COMMON_GPU AVAILABLE_QUANTITIES
T4 1
$ sky launch --cloud kubernetes --gpus T4
# My cluster ran as expected.
I have an existing GKE cluster cluster-1
that I created a new nodepool adding one T4 instance. I shouldn't need to purge ~/.sky
right?
$ PROJECT_ID=$(gcloud config get-value project)
$ CLUSTER_NAME=cluster-1
$ gcloud beta container node-pools create t4-nodepool --cluster=${CLUSTER_NAME} --zone=us-central1-c --node-locations=us-central1-c --num-nodes=1 --total-min-nodes=1 --total-max-nodes=1 --reservation-affinity=none --no-enable-autorepair --location-policy=ANY --machine-type=n1-standard-2 --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=latest
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Note: Starting in GKE 1.30, if you don't specify a driver version, GKE installs the default GPU driver for your node's GKE version.
Creating node pool t4-nodepool...done.
Created [https://container.googleapis.com/v1beta1/projects/trainy-test/zones/us-central1-c/clusters/cluster-1/nodePools/t4-nodepool].
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
t4-nodepool n1-standard-2 100 1.28.7-gke.1026000
$ kubectl describe nodes | grep cloud.google.com/gke-accelerator
cloud.google.com/gke-accelerator=nvidia-tesla-t4
cloud.google.com/gke-accelerator=nvidia-tesla-t4,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=contain...
$ sky show-gpus --cloud kubernetes
COMMON_GPU AVAILABLE_QUANTITIES
T4 1
Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
$ sky launch --cloud kubernetes --gpus T4
I 05-02 18:08:39 optimizer.py:1209] No resource satisfying Kubernetes({'T4': 1}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
resources: Kubernetes({'T4': 1}).
To fix: relax or change the resource requirements.
$ sky -c
skypilot, commit d27e0ff83c56983920a655fbeaddc96b2758752e
Ah looks like your instance does not have enough memory to satisfy the default resource request of 2 CPUs and 8GB memory. Note that some CPU millicores and memory goes to k8s components, so n1-standard-2 with 2 CPUs and 7.5GB memory would not be able to fit the default resources requested by SkyPilot.
This is surfaced in debug logs (export SKYPILOT_DEBUG=1
):
D 05-02 13:40:51 kubernetes.py:344] Instance type 2CPU--8GB--1T4 does not fit in the Kubernetes cluster. Reason: GPU nodes with T4 do not have enough CPU and/or memory. Maximum resources found on a single node: 2.0 CPUs, 7.3G Memory
Explicitly specifying a lower CPU/mem request (e.g., sky launch --cloud kubernetes --gpus T4 --cpus 1 --memory 2
) should work.
TODO for us is to make the log messages better - perhaps resources: Kubernetes({'T4': 1})
should have shown the CPUs and memory requested. Leaving the issue open for us to fix logging. Thanks for the report!
Bumping the priority for this - another user ran into issues with SkyPilot unable to use resources on k8s and had to use SKYPILOT_DEBUG=1 to surface the error. This should be logged to info.
I'm running a single T4 node on GKE. Nodes are properly labeled as shown below and
sky show-gpus --cloud kubernetes
is also correct but fails to launch.Version & Commit info:
sky -v
:skypilot, version 1.0.0-dev0
sky -c
:skypilot, commit 889adce65602b76e31f60534ce25c264bad7cb83