skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.2k stars 426 forks source link

[UX/GCP] Explicit error when GCP reauth is set #3393

Open Michaelvll opened 3 months ago

Michaelvll commented 3 months ago

When GCP reauth is set, user's credential could expire periodically, and we should error out quickly for sky launch if the credential expires.

concretevitamin commented 1 month ago

A user ran into this GCP reauth error during a service spinning up new replicas:

However, when it tries to load another replica it keeps fails in provision. sky-service-7238 45 2 - - - FAILED_PROVISION - - sky-service-7238 46 2 - - - FAILED_PROVISION - - sky-service-7238 47 2 - - - FAILED_PROVISION - - sky-service-7238 48 2 - - - FAILED_PROVISION - - sky-service-7238 49 2 - - - PROVISIONING - - When I see the log it looks like it is failing because of the login. From log of sky-service-7238 49: google.auth.exceptions.RefreshError: Reauthentication is needed. Please run gcloud auth application-default login to reauthenticate.

Namely, initial spin up of the service succeeded because the reauth timeout hadn't hit; but later, autoscaling or recovery of replicas triggered new provisioning, and the timeout hit then.

Suggested workaround is to

Should probably find a more elegant way to handle this.

Michaelvll commented 4 weeks ago

This has occurred to some other users as well. We need to prioritize this.

If the user is GCP only, a simpler workaround is to: