skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.75k stars 501 forks source link

Cluster unable to get resource on Azure A10 Instance #3310

Closed binarycrayon closed 7 months ago

binarycrayon commented 7 months ago

resources requested

resources:
  cloud: azure
  ports: 8080
  accelerators: A10:1
  region: westus2

able to provision instance but blocked at INFO: Waiting for task resources on 1 node. This will block if the cluster is full.

======== Autoscaler status: 2024-03-13 21:41:27.656076 ========
Node status
---------------------------------------------------------------
Active:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/1.0 A10
 0.0/6.0 CPU
 0B/31.73GiB memory
 0B/15.87GiB object_store_memory

Demands:
 {'CPU': 0.5, 'A10': 1.0, 'GPU': 1.0} * 1 (STRICT_SPREAD): 1+ pending placement groups
Michaelvll commented 7 months ago

Thank you for reporting this issue @binarycrayon! We just pushed a fix for this in #3313. Could you help test if it works with A10 GPUs on Azure, as we don't have the quota for A10 on Azure? : )

If you would like to test it out, the following would be the line to install the fix from that PR: pip uninstall skypilot skypilot-nightly; pip install git+https://github.com/skypilot-org/skypilot.git@bcac2d764ae5e5fcac8fd64549888573a0b1d39a

binarycrayon commented 7 months ago

Yes, confirmed the fix worked. Thanks so much for the quick fix!

I 03-14 20:22:51 cloud_vm_ray_backend.py:4237] Creating a new cluster: 'dialogue-choice-gemma-2b' [1x Azure(Standard_NV6ads_A10_v5, {'A10': 1}, ports=['8080'])].
I 03-14 20:22:51 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 03-14 20:22:57 cloud_vm_ray_backend.py:1364] To view detailed progress: tail -n100 -f /home/../sky_logs/sky-2024-03-14-20-22-48-834635/provision.log
I 03-14 20:22:58 cloud_vm_ray_backend.py:1754] Launching on Azure westus2
I 03-14 20:25:28 log_utils.py:45] Head node is up.
I 03-14 20:28:16 cloud_vm_ray_backend.py:1602] Successfully provisioned or found existing VM.
I 03-14 20:28:20 cloud_vm_ray_backend.py:3076] Running setup on 1 node.