Open zaptrem opened 1 week ago
Could you share the configs you are setting and what command you are running? It should supposedly waiting and taking the capacity block that becomes available. Could you share sky jobs logs --controller
for the job you were running?
Could you share the part of the code that does this waiting? I gave up and switched to EKS but am interested anyway. Thanks!
You could either use --retry-until-up
to wait for the resources or use sky jobs launch
to have skypilot waiting for the resources. You may want to specify the region of your reservation to limit the retry in that region.
The code that waits for the resources: https://github.com/skypilot-org/skypilot/blob/master/sky/jobs/recovery_strategy.py#L127-L144
Here retry_until_up
is True by default
https://skypilot.readthedocs.io/en/latest/reservations/reservations.html Docs say
However, this doesn't actually happen. In aws utils it filters by blocks that have already started, and I can't find any logic elsewhere that would actually wait on reservations to start. Unfortunate since it's 1:30am before the block was gonna start and given the docs I was expecting not to have to stay up :(
Additionally, it appears to pretend instances in the catalog that don't have prices (e.g., H200s on AWS) simply don't exist unless you modify the catalog to add a price.