uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Failure to refresh credentials with botocore #270

Closed Sytten closed 5 years ago

Sytten commented 5 years ago

Hello! We use an EKS cluster with kiam for some time now and it's been working fine (we use the strategy of tagging some hosts as masters to run the server part of kiam). Now, I have a python application that uses kinesis and I keep hitting this weird failure of credentials refresh:

botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from iam-role: Credential refresh failed, response did not contain: access_key, secret_key, token, expiry_time

I followed https://github.com/uswitch/kiam/issues/96#issuecomment-417315210 and increased the timeouts and retries but still hit those errors and I am short on ideas on how to solve this problem. I enabled debug on kiam and I see that this new program is quite spammy with sometimes multiple messages per second like:

kiam-server-r6pdx kiam-server {"level":"info","msg":"found role","pod.iam.role":"my-service","pod.ip":"10.0.0.219","time":"2019-07-19T20:13:30Z"}

Maybe this indicates that botocore is trying to refresh creds all the time and overloads the kiam server?

Sytten commented 5 years ago

Update: The application seems to request new credentials on every request to AWS. Digging deeper in botocore I found that the _advisory_refresh_timeout is set to 15m which is the same has the default timeout set for kiam. This tells me that the errors are probably due to the server getting too many requests from my service and starting to lag behind. I am not sure if I should increase the timeout in kiam or try to decrease the refresh timeout in botocore. I think I am on the right track though.

Sytten commented 5 years ago

For reference I solved the issue by increasing session-refresh to 20 minutes and session-duration to 1 hour. The default behavior of botocore is to try to refresh the credentials 15 minutes before they expire. Because the credentials are 15 minutes by default in kiam, the service was asking for new credentials on every request to AWS and overloaded the agent/server to a point where some requests would timeout causing the application to crash.

Now with these new values, the application works as expected and only make one request per hour. Increasing the session-refresh is also important because leaving it at the 5 minutes default would cause the same issue as before during 10 minutes (T-15m to T-5m), because the kiam server would return the same set of credentials, but botocore would consider them "almost expired" and would try to renew them on every request.

I still suggest to follow the advice of https://github.com/uswitch/kiam/issues/96#issuecomment-417315210 and increase at least the retries attempts to 5 and maybe increase the timeout too to avoid problems even if kiam has an hiccup. I still have an issue when upgrading kiam where the service is likely to die, but it doesn't happen too often so I can live with it.

Since I am probably not the only one who faced this issue, I highly suggest to the devs to increase the defaults to something higher @pingles @Joseph-Irving

prajwalgowda-uc commented 1 year ago

@Sytten I am facing the same issue, i tried to set those variable like this `custom_session = botocore.session.Session() custom_session.set_config_variable('session-duration',3600) custom_session.set_config_variable('session-refresh',1200)

    session = boto3.Session(botocore_session = custom_session)

s3_client = session.client(service_name='s3') ` it didn't work. can I know how did you solve the issue