skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[AWS] Credential retry for rotation is not effective #4275

Open Michaelvll opened 2 weeks ago

Michaelvll commented 2 weeks ago

During AWS credential rotation, aws client may not be able to find the credentials, and our retry logic is not effective, since the client will only check the credentials when an operation is called.

https://github.com/skypilot-org/skypilot/blob/654ed4a2b7623693be5c2a1b9cfc44348f275463/sky/adaptors/aws.py#L99-L119

To reproduce:

  1. remove aws credential
  2. >>> aws.client('ec2', region_name='us-east-1')
    <botocore.client.EC2 object at 0x7f1eaa79e920>
>>> sts = aws.client('sts')
>>> sts.get_caller_identity()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 1001, in _make_api_call
    http, parsed_response = self._make_request(
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 1027, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/signers.py", line 199, in sign
    auth.add_auth(request)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

This can cause the jobs controller fail to refresh / terminate clusters.

Version & Commit info: