kubernetes-sigs / secrets-store-csi-driver

Secrets Store CSI driver for Kubernetes secrets - Integrates secrets stores with Kubernetes via a CSI volume.
https://secrets-store-csi-driver.sigs.k8s.io/
Apache License 2.0
1.26k stars 293 forks source link

KeyVault 429 TooManyRequests led to infinite loop in reconciler workqueue #1483

Open JasonXD-CS opened 6 months ago

JasonXD-CS commented 6 months ago

What steps did you take and what happened: We have an Azure Key Vault with average requests lower than the KeyVault throttling limit. and recently ran into outage when started using CSI Driver with auto rotation with 3 hours interval. We had regular scale up at peak hours and that triggered Key Vault throttling and continue to be throttling for hours until we disable auto rotation. As a result, none of the services could be created as it's stuck in ContainerCreating state and we had to revert back to use KeyVaultAgent.

After investigation we discovered a few issue with reconciler implementation:

  1. Current auto rotation design is inefficient and not scalable as it's rotating secrets per pod, which makes a lot of unnecessary requests. If a deployment running 1000 replicas is downloading 10 secrets each. The amount of extra requests made is 10000 vs 10 if rotating per deployment.

  2. According to this post Understanding how many calls are made to KeyVault?

    "In case of rotation, the list of all existing pods is generated and the contents in the mount are rotated linearly. A workqueue is created in the driver, all 10k pods added to queue and then the queue is processed"

However, workqueue does not handle 429 and no exponential backoff.
https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/bf86dbf98ad3e32a0f55e52d6a411abd6784f7fb/pkg/rotation/reconciler.go#L401-L405 So each task in the queue doesn't know anything about 429 throttling and it will just continue to process these requests in the workqueue without backoff. So it doesn't give any time for Key Vault recover as it continues to make these requests. This is amplified when there are thousands of nodes pulling from the same KeyVault.

  1. It requeues the failed request back to the workqueue after 10 seconds delay with no maxRequeueLimit. So that creates an infinite loop if KeyVault could not recover from throttling https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/bf86dbf98ad3e32a0f55e52d6a411abd6784f7fb/pkg/rotation/reconciler.go#L242C1-L243C1

What did you expect to happen:

  1. KeyVault throttling 429 TooManyRequest should be properly handled in the reconciler.
  2. Exponential backoff for rotation requests processing.

Anything else you would like to add: Regarding #1, the current design is not scalable. Would love to hear from the team what's the plan for optimizing this going forward.

Also, does the polling interval starts counting when all rotation requests are processed? Or it starts as soon as List SPCPC is invoked? If it starts as soon as List SPCPC is invoked and add these task to workqueue, Does that mean new iteration will add more rotation task to the workqueue despite previous iteration didn't finish them and the queue just continue to pile up?

Which provider are you using: Azure Key Vault.

Environment:

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Gil-Tohar-Forter commented 3 months ago

Can this be prioritized, it doesn't make sense that a request is made to Vault for each pod, it should really be one per deployment, since all the pods for a given deployment are using the same secret.

With large scale services, sercret rotation becomes an unusable feature in fear of 429s from the secret provider or simply crashing the secret provider

aramase commented 3 months ago

/remove-lifecycle stale

hakman commented 1 month ago

/lifecycle frozen