hashicorp / vault-secrets-operator

The Vault Secrets Operator (VSO) allows Pods to consume Vault secrets natively from Kubernetes Secrets.
https://hashicorp.com
Other
468 stars 101 forks source link

Do not Trigger RolloutRestart with every secret change #644

Open koolhandluke opened 7 months ago

koolhandluke commented 7 months ago

The RolloutRestart feature is a great feature of the operator. However, the current implementation can cause outages or service degradation if several secrets are rotated for a given application within a short interval. This would set off a series of "thrashing" restarts.

Could we add a feature to "batch" up restarts for a given target ? This way it would ensure only 1 rolling restart is executed within that period - eliminating unnecessary pod restarts.

eg the following would do a max of one rolling restart for the " vso-db-demo" deployment per hour. rolloutRestartTargets:

benashz commented 7 months ago

@koolhandluke - that's a great suggestion. We have had some internal discussion around its implementation, but currently have no concrete commitment to add it.

koolhandluke commented 7 months ago

HI @benashz - thanks. I was thinking one approach would be to use some kind of cache.

The cache period would be the batch deploy interval.

it could work like so:

  1. Secret 1 is updated for deployment A. HandleRolloutRestarts() is triggered. Instead of triggering the restart for the target it would put the job in cache .
  2. Secret 2 is updated for deployment A. HandleRolloutRestarts() is triggered. It finds the restart entry for that same target in cache. exit.
  3. When the entry for deployment A expires after set interval execute the RolloutRestart.

Bonus points for doing it within a given deployment window.. ( not true CD but that is the world some folks live in)

thiago-juro commented 7 months ago

I am also interested in this feature. In our case, we have a couple of micro-services fetching the same secret. Changing the secret triggers the restart of all micro-services which can cause downtime.