skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Core] Expired credentials causes unexpected failure of `sky launch` #4373

Open Michaelvll opened 6 days ago

Michaelvll commented 6 days ago

Many cloud credentials have expiration, e.g. AWS SSO, or kubeconfig with gcloud authentication. Expiration of these credentials means, although the cloud is enabled in the cache, sky launch will go to those cloud and try to launch the cluster on those cloud causing unexpected failure and not automatically failover, e.g.:

Exception: ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '7f9b7f8a-d526-44fc-ace8-5c12093e441f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 15 Nov 2024 20:40:38 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

Version & Commit info:

zpoint commented 2 days ago

Related issue