Don't remove credentials during temporary issues

jenkinsci / aws-secrets-manager-credentials-provider-plugin

AWS Secrets Manager Credentials Provider for Jenkins

https://plugins.jenkins.io/aws-secrets-manager-credentials-provider/

MIT License

64 stars 43 forks source link

Don't remove credentials during temporary issues #318

Open LeonPatmore opened 3 months ago

LeonPatmore commented 3 months ago

What feature do you want to see added?

Hello, I am using this Jenkins plugin to sync secrets from secretsmanager. Sometimes we get an temporary error when trying to sync the secrets, such as:

WARNING
i.j.p.c.s.AwsCredentialsProvider#getCredentials: Could not list credentials in Secrets Manager: 
message=[Rate exceeded (Service: AWSSecretsManager; Status Code: 400; Error Code: ThrottlingException; Proxy: null)]

When this happens, it seems like the secrets that should come from secret manager are no longer accessible by our jobs. They fail with:

ERROR: Could not find credentials entry with ID `<secret>`

Would it be possible to keep the cached secrets during a failed refresh event (assuming the refresh failed due to a temporary issue). This way temporary issues would not impact our jobs.

Cheers

Upstream changes

No response

Are you interested in contributing this feature?

No response

chriskilding commented 3 months ago

Hi Leon, a couple of details about how the caching works:

The list of credential names is looked up once, when a job first wants to get a credential, and then cached for 5 minutes. (Effectively the ListSecrets call in getCredentials is cached.)
The credential value itself is always looked up live (this is the GetSecretValue call), because the sensitive credential material must never be cached.

From what you posted it looks like you encountered the AWS API error when the list of credential names was fetched. Since there is already a caching strategy for that part, the only thing I can think of is that you should check that the cache has not been turned off in your plugin configuration.

LeonPatmore commented 2 months ago

Understood, thanks for the reply. Out of interest @chriskilding , do you have any recommendations for dealing with high throughput jobs that reply on secrets? The issue is that we have quite a few jobs running, multiple times a minute, and this is putting pressure on the AWS secret manager limits.

The only thing I can think of right now is it manually copy the secret as a local Jenkins secret. In an ideal world we would have something that would cache the value of the secret so that we can avoid many lookups.

chriskilding commented 2 months ago

Hi Leon,

One thing that may help is that, because AWS appreciate that caching secret values is unwise, they have a much higher rate limit for the GetSecretValue call compared to the ListSecrets call.

From this guide: https://docs.aws.amazon.com/secretsmanager/latest/userguide/reference_limits.html we see the following rate limits:

GetSecretValue + DescribeSecret combined

Each supported Region: 10,000 per second

ListSecrets

Each supported Region: 100 per second

Based on these values, I'd say that if you are hitting the ListSecrets rate limit, you're either:

running the plugin with the cache turned off - in which case please turn it on!
running many, many Jenkins clusters, or at least Jenkins controllers (since only the Jenkins controller node performs the ListSecrets call)
running Jenkins in an AWS account where a lot of other services or scripts are also calling ListSecrets at the same time (so its 'noisy neighbours' are crowding it out)

Would you be able to share some details about how you're running Jenkins, just to see if there's anything else we can do?