skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.75k stars 501 forks source link

Skypilot doesn't actually wait on capacity blocks as the docs says it does #4155

Open zaptrem opened 1 week ago

zaptrem commented 1 week ago

https://skypilot.readthedocs.io/en/latest/reservations/reservations.html Docs say

If you have a capacity block with a starting time in the future, you can run sky jobs launch --region us-east-1 --gpus H100:8 task.yaml to let SkyPilot automatically wait until the starting time is reached. Namely, you don’t have to wake up at 4:30am PDT to launch your job on a newly available capacity block.

However, this doesn't actually happen. In aws utils it filters by blocks that have already started, and I can't find any logic elsewhere that would actually wait on reservations to start. Unfortunate since it's 1:30am before the block was gonna start and given the docs I was expecting not to have to stay up :(

Additionally, it appears to pretend instances in the catalog that don't have prices (e.g., H200s on AWS) simply don't exist unless you modify the catalog to add a price.

Michaelvll commented 1 week ago

Could you share the configs you are setting and what command you are running? It should supposedly waiting and taking the capacity block that becomes available. Could you share sky jobs logs --controller for the job you were running?

zaptrem commented 6 days ago

Could you share the part of the code that does this waiting? I gave up and switched to EKS but am interested anyway. Thanks!

Michaelvll commented 5 days ago

You could either use --retry-until-up to wait for the resources or use sky jobs launch to have skypilot waiting for the resources. You may want to specify the region of your reservation to limit the retry in that region.

The code that waits for the resources: https://github.com/skypilot-org/skypilot/blob/master/sky/jobs/recovery_strategy.py#L127-L144

Here retry_until_up is True by default