SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
During AWS credential rotation, aws client may not be able to find the credentials, and our retry logic is not effective, since the client will only check the credentials when an operation is called.
>>> aws.client('ec2', region_name='us-east-1')
<botocore.client.EC2 object at 0x7f1eaa79e920>
>>> sts = aws.client('sts')
>>> sts.get_caller_identity()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 1001, in _make_api_call
http, parsed_response = self._make_request(
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 1027, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
request = self.create_request(request_dict, operation_model)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
self._event_emitter.emit(
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
return self.sign(operation_name, request)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/signers.py", line 199, in sign
auth.add_auth(request)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
This can cause the jobs controller fail to refresh / terminate clusters.
During AWS credential rotation, aws client may not be able to find the credentials, and our retry logic is not effective, since the client will only check the credentials when an operation is called.
https://github.com/skypilot-org/skypilot/blob/654ed4a2b7623693be5c2a1b9cfc44348f275463/sky/adaptors/aws.py#L99-L119
To reproduce:
This can cause the jobs controller fail to refresh / terminate clusters.
Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_IN