skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.69k stars 494 forks source link

sky spot failover doesn't work on GCP #831

Closed infwinston closed 2 years ago

infwinston commented 2 years ago

When provisioning spot 8xA100 and it's unavailable, Sky failed immediately and didn't fail over to other regions. The reason is Sky only handles GCP return code ZONE_RESOURCE_POOL_EXHAUSTED but not ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS which is also valid according to this doc. PR https://github.com/sky-proj/sky/pull/829 also fixes this here.

(sky) weichiang@blaze:~/repos/sky$ sky launch --use-spot examples/misc/A100.yml                                                                                                           
Task from YAML spec: examples/misc/A100.yml                                                                                                                                               
Launching a new cluster. Proceed? [Y/n]:                                                                                                                                                  
I 05-11 22:37:21 optimizer.py:617] Estimated cost: ~$25.3/hr                                                                                                                              
I 05-11 22:37:21 optimizer.py:632]                                                                                                                                                        
I 05-11 22:37:21 optimizer.py:632] TASK      BEST_RESOURCE                                                                                                                                
I 05-11 22:37:21 optimizer.py:632] gcp-a100  GCP(a2-highgpu-8g[Spot], {'A100': 8})                                                                                                        
I 05-11 22:37:21 optimizer.py:632]                                                                                                                                                        
I 05-11 22:37:21 cloud_vm_ray_backend.py:1239] Creating a new cluster: "sky-390c-weichiang" [1x GCP(a2-highgpu-8g[Spot], {'A100': 8})].                                                   
I 05-11 22:37:21 cloud_vm_ray_backend.py:1239] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.                                      
I 05-11 22:37:21 cloud_vm_ray_backend.py:768] To view detailed progress: tail -n100 -f /home/eecs/weichiang/sky_logs/sky-2022-05-11-22-37-21-443605/provision.log                         
I 05-11 22:37:23 cloud_vm_ray_backend.py:948] Launching on GCP us-central1 (us-central1-a)                                                                                                
W 05-11 22:37:44 cloud_vm_ray_backend.py:435] Got ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not 
have enough resources available to fulfill the request.  '(resource type:compute)'.)                                                                                                      
Traceback (most recent call last):                                                                                                                                                        
  File "/home/eecs/weichiang/repos/sky/sky/execution.py", line 130, in _execute                                                                                                           
    handle = backend.provision(task,                                                                                                                                                      
  File "/home/eecs/weichiang/repos/sky/sky/backends/cloud_vm_ray_backend.py", line 1304, in provision                                                                                     
    config_dict = provisioner.provision_with_retries(                                                                                                                                     
  File "/home/eecs/weichiang/repos/sky/sky/backends/cloud_vm_ray_backend.py", line 1046, in provision_with_retries                                                                        
    config_dict = self._retry_region_zones(                                                                         
  File "/home/eecs/weichiang/repos/sky/sky/backends/cloud_vm_ray_backend.py", line 859, in _retry_region_zones                                                                            
    self._update_blocklist_on_error(to_provision.cloud, region,                                                                 
  File "/home/eecs/weichiang/repos/sky/sky/backends/cloud_vm_ray_backend.py", line 559, in _update_blocklist_on_error               
    return self._update_blocklist_on_gcp_error(region, zones, stdout,                                                  
  File "/home/eecs/weichiang/repos/sky/sky/backends/cloud_vm_ray_backend.py", line 455, in _update_blocklist_on_gcp_error           
    assert False, error                                                                                                
AssertionError: {'code': 'ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS', 'message': "The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to f$
lfill the request.  '(resource type:compute)'."}           
infwinston commented 2 years ago

Got another new code UNSUPPORTED_OPERATION during provisioning of spot vm...

(sky-a675-weichiang pid=180987) I 05-12 07:01:34 cloud_vm_ray_backend.py:951] Launching on GCP us-central1 (us-central1-a)
(sky-a675-weichiang pid=180987) W 05-12 07:02:05 cloud_vm_ray_backend.py:435] Got UNSUPPORTED_OPERATION in us-central1-a (message: Instance failed to start due to preemption.)
...
(sky-a675-weichiang pid=180987) AssertionError: {'code': 'UNSUPPORTED_OPERATION', 'message': 'Instance failed to start due to preemption.'}
Michaelvll commented 2 years ago

Thank you for capturing those new errors! Our exception list is manually maintained and we have not fully tested the spot before, as the spot instances are not very usable without the recovery. Please feel free to add those error messages in our error list.

infwinston commented 2 years ago

Sure I'll add the error! For UNSUPPORTED_OPERATION I'd blame google cloud as there seems to be no documentation on this anywhere. It's impossible for us to figure out before we actually hit it.