Open Michaelvll opened 3 months ago
A user ran into this GCP reauth error during a service spinning up new replicas:
However, when it tries to load another replica it keeps fails in provision. sky-service-7238 45 2 - - - FAILED_PROVISION - - sky-service-7238 46 2 - - - FAILED_PROVISION - - sky-service-7238 47 2 - - - FAILED_PROVISION - - sky-service-7238 48 2 - - - FAILED_PROVISION - - sky-service-7238 49 2 - - - PROVISIONING - - When I see the log it looks like it is failing because of the login. From log of sky-service-7238 49: google.auth.exceptions.RefreshError: Reauthentication is needed. Please run
gcloud auth application-default login
to reauthenticate.
Namely, initial spin up of the service succeeded because the reauth timeout hadn't hit; but later, autoscaling or recovery of replicas triggered new provisioning, and the timeout hit then.
Suggested workaround is to
Should probably find a more elegant way to handle this.
This has occurred to some other users as well. We need to prioritize this.
If the user is GCP only, a simpler workaround is to:
gcp.remote_identity: SERVICE_ACCOUNT
in ~/.sky/config.yaml
When GCP reauth is set, user's credential could expire periodically, and we should error out quickly for
sky launch
if the credential expires.