Proposal for an alternative algorithm for events-based cache event tracking

spiffe / spire

The SPIFFE Runtime Environment

https://spiffe.io

Apache License 2.0

1.81k stars 478 forks source link

Proposal for an alternative algorithm for events-based cache event tracking #5624

Open azdagron opened 2 weeks ago

azdagron commented 2 weeks ago

The experimental events-based cache relies on tracking change events for registration entries and attested nodes. The current algorithm relies on the monotonically increasing property of the event ID. Previous assumptions around arrival order of events and event ID increment stepping have contributed to a series of fixes to the current algorithm that have complicated the codebase (e.g. skipped event discovery and tracking) and contributed to pessimistic database query load, when one of the primary goals for the events-based cache is to reduce database load in the steady state. Organizations whose deployments tickle the conditions above are not realizing as much benefit from the events-based cache as should be possible.

I propose the following algorithm change to the events based cache:

SPIRE Server retains a list of events named processed_events, which is initially empty.
Every cache poll cycle:
1. all events over the last N minutes are retrieved, based on the event CreatedAt column, into a list recent_events
2. a diff is performed against processed_events and recent_events
  - events in recent_events but not inprocessed_events (i.e. new events) are processed and added to processed_events
  - events in processed_events but not in recent_events (i.e. old events) are removed from processed_events
  - events in both processed_events and recent_events (i.e. already processed events) are ignored

We can handle failures to talk to the database in a given poll cycle by either:

polling for max(N, last time we successfully polled), or
rehydrating the entire cache if it has been longer than some threshold since the last poll

This algorithm assumes a few things:

Number of events in the last N minutes is manageably sized: Seems reasonable if we keep N small, e.g. 3 minutes
No mutating transaction lives longer than N minutes: As far as I know, the only mutating operation that has the potential for a long-lived transaction is that which prunes registration entries, which executes a single statement to delete all registration entries older than some threshold. When the registration entry set is large, this operation can take some time. We can likely reduce the per-transaction time by doing the deletion in small batches and more frequently.
Event lookup by timestamp is efficient: The created_at column is not currently indexed, but can be.

PROS:

No more coupling with the event ID values. The value is now a simple lookup and only has to be unique across the N minute window.
Reduced database load for deployments that currently tickle large amounts skipped events in the current algorithm.
Reaches the goal of low-to-no database load when there is no churn over registration entries/agent
Code is much simpler.

CONS:

Database load is not immediately reduced when there is registration entry/agent churn since we'll still poll for events for the N minute window, but will fully reduce after N minutes.

I have prototyped the above algorithm and so far have not witnessed split brain in the cache though more testing is required.

amoore877 commented 2 weeks ago

I think it'd be good to get some perf numbers between the past, current, and proposed solution.

Is this something you need spire adopters in the community or in the wild to help collect data on?

I have prototyped the above algorithm and so far have not witnessed split brain in the cache though more testing is required.

Besides perf, reliability was also my biggest concern reading the proposal, so it's encouraging that so far signs are good :) If moving forward we should ensure having several unit and integration tests to try to tease out edge cases where locking is or is not saving us from split brain or other out-of-sync scenarios.

Code is much simpler.

My only higher preference than simple code is no code at all :) Definitely a worthy goal to strive to.

azdagron commented 2 weeks ago

Is this something you need spire adopters in the community or in the wild to help collect data on?

Absolutely. I do have a local benchmark set up to evaluate reliability and perf on a simulated load (e.g. something that churns entries at some rate and occasionally pauses and asserts that the cache in two or more SPIRE servers are in sync). Real life numbers are even better.

Outside of perf, it would be invaluable to know whether or not the first two of three assumptions hold in your deployment. If you have any of the following metrics, it would be super helpful:

Anything surrounding rate of change for registration entries/agents across a typical 24 hour period to help us understand how many events we'd be polling every poll cycle. For example, if you had a evenly distributed rate of 10 registration entries changed per second (e.g. churn of 800k workloads a day), that would be roughly 3000 events over a 5 minute period. Outliers are of particular interest to understand the maximum.
How long transactions take to commit in your database (for all but the prune transaction, this could be synthesized via RPC metrics). Outliers are again of particular interest to determine a reasonable value for N.

edwbuck commented 2 weeks ago

This algorithm has a number of issues:

The controller manager routinely deals with Kubernetes systems that have large numbers of Pods that appear and disappear. This means that systems that use it are routinely seeing groups of Pods arrive and shutdown. This is especially so in data processing, where Apache Spark has been moved off of traditional YARN frameworks into Kubernetes Pods.

We cannot assure that transactions live under N minutes, when N is a number less that the transaction time out length.
Using timestamps seems like a step forward, but they were originally proposed and abandoned, due to the additional complexity they impose.

A. Timestamps are not unique, they collide.

B. Timestamps are subject to the systems timezone settings.

C. Timestamps can be indexed, but they are not "more" or "less" indexed than a primary key.

D. Timestamps are captured at transaction creation time, whether using SQL functions like NOW() or variables like CURRENT_TIMESTAMP, so when manipulating a batch of records, simple construction tends to get identical timestamps.

E. Even if one uses "fancy" SQL to capture the row creation time, the timestamp is the time of the row being captured into the transaction data structure, not the time of the row being committed to the database. Additionally, row creation time is a SQL extension, and not available across the three databases we support.
1. While scanning a few minutes into the past would certainly capture short-lived transactions, it cannot capture long lived transactions that exceed whatever setting we use for N.
  
  A. This is of concern, as the entire reason for scanning non-detected Event Ids was due to the one failure scenario that created this effort, at Uber (at scale) there were records in the database that were not reflected in the event cache. I'm very grateful to Uber for doing the deep dive to discover this, as they attempted to determine why the cache lacked the settings necessary to issue that SVIDs involved.
B. Uber's findings were reported to Faisal and myself as a rare scenario. We were not certain how rare, but it was on the order of a handful of transactions per quarter.

C. The current climate among the maintainers insisted on a fully correct solution, the db event processing having suffered from "too many cooks in the kitchen" style design, leading to prior instability issues which didn't foster confidence in the current solution.

D. Uber's discovery of entries in the database that didn't exist in the cache strongly implies that the deletion transaction wasn't the transaction that ran late. In their scenario, the process that wasn't receiving SVIDs existed, so it's entries wouldn't have been part of a deletion transaction.

I don't think that the current system is complete, and there's been a lot of questioning that has delayed the work deploying it. However, all the problems are performance based, and two solutions should fix this:

Implementing some form of back off polling. It really doesn't matter in what form, as long as every item is not polled every polling cycle. My prior arguments for the previously rejected solution at hand was that is was available. It was being compared to solutions that didn't yet fully exist.
Implementing some form of batch querying. Gorm supports this, and it would reduce the processing by having a single query handle 200 or more items simultaneously without opening 200 different querying contexts.

However, if the maintainers wish to discard the currently observed to be error free solution for a new one, it's the maintainer's call. However, this does seem a bit reactionary, as if it were directly in response to not accepting one of many back off algorithms. I'd rather see a sub-optimal but functional back off algorithm accepted over a new means of scanning the database that clearly would miss records that arrive after a few minute delay.

For a volunteer effort, this also seems like re-inventing the solution by rewriting from scratch, an approach that is known to have costs and risks that often are under estimated. In many ways, this rolls back the clock on the current solution nearly six months, when these ideas were deemed insufficient to actually capture all changes.

edwbuck commented 2 weeks ago

* How long transactions take to commit in your database (for all but the prune transaction, this could be synthesized via RPC metrics). _Outliers are again of particular interest to determine a reasonable value for `N`._

There's no perfect answer to this, as we have very few observed items in the field that led to the failure this system was designed to fix. Or, it is equally possible that people didn't inform me of details which were the input into the current design. So, N in the current design is "the entire duration a transaction can exist". A practical approach might permit N to be much smaller, but a smaller N comes with increasing risks that the cache and the database get out of sync and remain out of sync for long periods of time (days, or longer).

amoore877 commented 2 weeks ago

Providing some numbers:

agreed with earlier comment that there may be a bit of difficulty around specificity in time-to-commit from an RPC level and how to handle that worst case.
- a tangential metric I can give is that replication to all read replicas is measured in millis. but this doesn't help for the longer transaction times.
- p99 DB transaction latency is sub-second. but again, there could be outliers.
typical day agent entry changes are so infrequent to not be considered in our case. however during a mass failover event we might see a 10k-20k burst.
registration entry changes in a typical day are <5k distributed evenly. however during a mass failover event we aim to handle a 200k-1M burst (which we recognize results in initially degraded latencies, but it is expected to eventually "even out")

azdagron commented 2 weeks ago

registration entry changes in a typical day are <5k distributed evenly. however during a mass failover event we aim to handle a 200k-1M burst (which we recognize results in initially degraded latencies, but it is expected to eventually "even out")

Hmm. I think that failover event blows up the first assumption "Number of events in the last N minutes is manageably sized" in a crippling way. In previous discussions I assumed that 1M registration changes were more or less distributed over a one day period (which is a reasonable ~3000 events every 5 minutes). If those registration changes are going to come in as fast as they can in the failover scenario, that number is no longer reasonable.

edwbuck commented 2 weeks ago

@amoore877 I see a lot of focus on p99 numbers. The main issue is that anything that's below p99 works, but the failures are not statistically going to be captured by p99.

A single long-lived database transaction will be enough to cause cache synchronization issues. If we focus on p99, then that means we only need 100 successful transactions to ignore the failing one, and our cache will still be out-of-sync.

For this reason, I don't see this as an area where statistical analysis takes precedence over a very old, traditional concept of "code correctness". We had a system where one event outside of the p99 can break the system, and in the past we realized that saying it worked 99% of the time didn't buy sympathy for the one time it led to a running server that didn't reflect its backing database.

That's why we treat every database change that might exist as something to consider until it cannot exit. It is an extreme point of view, but one that ensures that the cache will always be in-sync.

Should we stop polling the "might newly exist" changes at the same frequency we poll the entire database for new records, we can reduce the impact of polling the "might newly exist" records a lot. The previously rejected submission reduced it by 1/60th, but that was an arbitrary setting. While I argued extensively for that solution (it had a lot of nice features, which led to its approach), any solution that polls with less frequency would be sufficient.

With the proposed approach in this Issue, any transaction that managed to live past the setting for N would forever be missed, and the cache would be out of sync until one of the following occurred:

The server is manually restarted.
A full cache reload is scheduled.
A follow up change on the same record was recorded.

The first scenario is undesirable, because there's a lot that goes on in a server restart, including denying services to the other unaffected SVIDs.

The second scenario is undesirable, because nobody is going to make the call on what the frequency of "having the right data in the cache" is. Previous calls for this went unanswered. Late in our analysis, it was shared with me that Uber, the reporting entity, was updating their cache once a second. In many minds, any delays sill create failures on the Pod or process, leading to quick reloads. However, the faster the full reload, the more we will see the benefits of the db event approach disappear. At five seconds, it would be more efficient for the db event approach to simply not exist (but that also comes with 10,000 times more database CPU overhead).

The third scenario basically is indeterminable. One can't surmise when the record might be updated.

For this reason, I would feel more comfortable with a embarrassingly bad polling back off algorithm than re-writing the framework from scratch.

I put too much of my social capital in attempting to get the previously rejected back off algorithm accepted. I regret this, and to a degree, I partially see this solution as a wholesale call to disregard all lessons learned for a "simpler time".

Please do not be swayed by p99, because it means that after 100 cache updates, it's okay if your cache is out of sync once.

I recommend that we choose any back off algorithm, even one known to be embarrassingly bad, instead of rewriting the currently existing solution which (to my knowledge) has been running error free for multiple releases.

azdagron commented 2 weeks ago

FWIW, this failover event would generate 1M skipped events to track for 12 hours (current transaction timeout value). Even with the backoff algorithm that has not yet been accepted, that would be roughly 16K queries a second for 12 hours.

edwbuck commented 2 weeks ago

@azdagron I'm ok with polling once per hour, as long as there is one poll that is guaranteed to be beyond the limit of the transaction timeout. That said, I fully understand that the people who are going to be impacted really are the people that have the authority to set the limit.

Unfortunately, there's no limit I've ever received. For this reason, I created a polling system that could be configured to poll at any limit, in any manner (some recommended exponential backoff, some indicated that a maximum limit would be desired, and linear backoff seemed better suited to the problem).

For this reason the backoff system I previously proposed was configurable, which made it far more complex than a non-configurable system. The decision to make it configurable I now see as a mistake.

I recommend that we implement your previous priority queue backoff, which I don't see as optimal, but at least adding it to a correct polling approach is preferable to possibly creating a new approach that would miss database changes. Of course, your approach uses random numbers, making it very difficult to test deterministically, but even that's better than "let's start all over again" which would likely recreate the issues that led to the solution we currently have.