spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.77k stars 467 forks source link

Entry Update and Event Creation Can Possibly Drift. Consider periodic full-db reconciliation of cache #4973

Open stevend-uber opened 5 months ago

stevend-uber commented 5 months ago

In the event that there occurs drift between in-mem event-based cache and entries/agents tables, we should consider a configurable periodic method that does a full reconciliation of the in-mem cache with the db.

faisal-memon commented 5 months ago

I believe kubernetes does something similar to this with reconciling controllers

evan2645 commented 5 months ago

Thanks for opening this @stevend-uber, definitely sounds like there's something we need to fix here. Are you able to share more information about what you bumped into, if anything?

stevend-uber commented 5 months ago

In the 1.8.7 release of the events-based cache, we saw that the pruned entries weren't being updated in the cache. And although that is a "one time temporary bug", cache drift can occur for any number of reasons. Defensively, a configurable reconciliation loop should prevent such from drifting too much.

faisal-memon commented 5 months ago

Please assign this to me.

edwbuck commented 5 months ago

@stevend-uber in 1.8.7 release, there were bugs in the events-based cache, because the implementation never considered the mix of enabled/non-enabled systems that Uber ran. Those items were quickly fixed, prior to the 1.9.0 release.

Do you have evidence of a specific failure since those fixes were applied? It seems to me that doing this "just in case" an event is missed falls into the same category as doing this "just in case" you missed reading a vanilla registration entry, as they are both read from the same database.

If there is a know failure, we need to capture it and fix it. If there is no known failure, then it is not clear why this code would be desired.

To clarify further. If no failure scenario is observed, this could potentially induce one, for the systems that cannot easily reconcile their entire database into hydration within one time slice of cache updates. If we offer this feature, please consider having it disabled by default.

rturner3 commented 5 months ago

Just to share some more context, Uber has seen a bug with the event cache feature for a very small percentage of entries. That issue motivated this suggestion. Reported this bug as a separate issue: https://github.com/spiffe/spire/issues/5021

edwbuck commented 5 months ago

mentioning #4985 as this issue may be blocker. If fixes to #5021 likely would make this effort unneeded, please indicate it here.

edwbuck commented 4 months ago

@rturner3 I believe that the fix under issue #5021 closes the means by which the cache and the database drifted at Uber. @faisal-memon has a branch that he's about to make into a merge request, it only has one small item he was working on when we talked last.

Please consider what you would like to do with this Issue if #5021 covers the cache synchronization errors at Uber.