Closed infwinston closed 2 years ago
Got another new code UNSUPPORTED_OPERATION
during provisioning of spot vm...
(sky-a675-weichiang pid=180987) I 05-12 07:01:34 cloud_vm_ray_backend.py:951] Launching on GCP us-central1 (us-central1-a)
(sky-a675-weichiang pid=180987) W 05-12 07:02:05 cloud_vm_ray_backend.py:435] Got UNSUPPORTED_OPERATION in us-central1-a (message: Instance failed to start due to preemption.)
...
(sky-a675-weichiang pid=180987) AssertionError: {'code': 'UNSUPPORTED_OPERATION', 'message': 'Instance failed to start due to preemption.'}
Thank you for capturing those new errors! Our exception list is manually maintained and we have not fully tested the spot before, as the spot instances are not very usable without the recovery. Please feel free to add those error messages in our error list.
Sure I'll add the error! For UNSUPPORTED_OPERATION
I'd blame google cloud as there seems to be no documentation on this anywhere. It's impossible for us to figure out before we actually hit it.
When provisioning spot 8xA100 and it's unavailable, Sky failed immediately and didn't fail over to other regions. The reason is Sky only handles GCP return code
ZONE_RESOURCE_POOL_EXHAUSTED
but notZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS
which is also valid according to this doc. PR https://github.com/sky-proj/sky/pull/829 also fixes this here.