skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.62k stars 481 forks source link

[GCP] TPU v4 fails to launch #3840

Open romilbhardwaj opened 1 month ago

romilbhardwaj commented 1 month ago
$ sky launch -c test task.yaml --cloud gcp --gpus tpu-v4-16

I 08-16 17:27:27 provisioner.py:65] Launching on GCP us-central2 (us-central2-b)
W 08-16 17:27:37 instance_utils.py:112] Got return code 'CREATION_FAILED' in us-central2-b: 'Cloud TPU received a bad request. the accelerator v4-128 was not found in zone us-central2-b [EID: 0x945ef875d10e8e13]'
D 08-16 17:27:37 provisioner.py:171] Failed to provision 'test' on GCP (us-central2-b).

Happens for other TPU slices (8, 16, 32 ...) too.

Version & Commit info:

Michaelvll commented 1 week ago

Just a note, it seems v5 is working fine.