Do not Trigger RolloutRestart with every secret change

hashicorp / vault-secrets-operator

The Vault Secrets Operator (VSO) allows Pods to consume Vault secrets natively from Kubernetes Secrets.

https://hashicorp.com

Other

468 stars 101 forks source link

Do not Trigger RolloutRestart with every secret change #644

Open koolhandluke opened 7 months ago

koolhandluke commented 7 months ago

The RolloutRestart feature is a great feature of the operator. However, the current implementation can cause outages or service degradation if several secrets are rotated for a given application within a short interval. This would set off a series of "thrashing" restarts.

Could we add a feature to "batch" up restarts for a given target ? This way it would ensure only 1 rolling restart is executed within that period - eliminating unnecessary pod restarts.

eg the following would do a max of one rolling restart for the " vso-db-demo" deployment per hour. rolloutRestartTargets:

kind: Deployment name: vso-db-demo deployInterval: "60m"

benashz commented 7 months ago

@koolhandluke - that's a great suggestion. We have had some internal discussion around its implementation, but currently have no concrete commitment to add it.

koolhandluke commented 7 months ago

HI @benashz - thanks. I was thinking one approach would be to use some kind of cache.

The cache period would be the batch deploy interval.

it could work like so:

Secret 1 is updated for deployment A. HandleRolloutRestarts() is triggered. Instead of triggering the restart for the target it would put the job in cache .
Secret 2 is updated for deployment A. HandleRolloutRestarts() is triggered. It finds the restart entry for that same target in cache. exit.
When the entry for deployment A expires after set interval execute the RolloutRestart.

Bonus points for doing it within a given deployment window.. ( not true CD but that is the world some folks live in)

thiago-juro commented 7 months ago

I am also interested in this feature. In our case, we have a couple of micro-services fetching the same secret. Changing the secret triggers the restart of all micro-services which can cause downtime.