Open azdagron opened 2 weeks ago
I think it'd be good to get some perf numbers between the past, current, and proposed solution.
Is this something you need spire adopters in the community or in the wild to help collect data on?
I have prototyped the above algorithm and so far have not witnessed split brain in the cache though more testing is required.
Besides perf, reliability was also my biggest concern reading the proposal, so it's encouraging that so far signs are good :) If moving forward we should ensure having several unit and integration tests to try to tease out edge cases where locking is or is not saving us from split brain or other out-of-sync scenarios.
Code is much simpler.
My only higher preference than simple code is no code at all :) Definitely a worthy goal to strive to.
Is this something you need spire adopters in the community or in the wild to help collect data on?
Absolutely. I do have a local benchmark set up to evaluate reliability and perf on a simulated load (e.g. something that churns entries at some rate and occasionally pauses and asserts that the cache in two or more SPIRE servers are in sync). Real life numbers are even better.
Outside of perf, it would be invaluable to know whether or not the first two of three assumptions hold in your deployment. If you have any of the following metrics, it would be super helpful:
N
.This algorithm has a number of issues:
We cannot assure that transactions live under N minutes, when N is a number less that the transaction time out length.
Using timestamps seems like a step forward, but they were originally proposed and abandoned, due to the additional complexity they impose.
A. Timestamps are not unique, they collide.
B. Timestamps are subject to the systems timezone settings.
C. Timestamps can be indexed, but they are not "more" or "less" indexed than a primary key.
D. Timestamps are captured at transaction creation time, whether using SQL functions like NOW()
or variables like CURRENT_TIMESTAMP
, so when manipulating a batch of records, simple construction tends to get identical timestamps.
E. Even if one uses "fancy" SQL to capture the row creation time, the timestamp is the time of the row being captured into the transaction data structure, not the time of the row being committed to the database. Additionally, row creation time is a SQL extension, and not available across the three databases we support.
While scanning a few minutes into the past would certainly capture short-lived transactions, it cannot capture long lived transactions that exceed whatever setting we use for N.
A. This is of concern, as the entire reason for scanning non-detected Event Ids was due to the one failure scenario that created this effort, at Uber (at scale) there were records in the database that were not reflected in the event cache. I'm very grateful to Uber for doing the deep dive to discover this, as they attempted to determine why the cache lacked the settings necessary to issue that SVIDs involved.
B. Uber's findings were reported to Faisal and myself as a rare scenario. We were not certain how rare, but it was on the order of a handful of transactions per quarter.
C. The current climate among the maintainers insisted on a fully correct solution, the db event processing having suffered from "too many cooks in the kitchen" style design, leading to prior instability issues which didn't foster confidence in the current solution.
D. Uber's discovery of entries in the database that didn't exist in the cache strongly implies that the deletion transaction wasn't the transaction that ran late. In their scenario, the process that wasn't receiving SVIDs existed, so it's entries wouldn't have been part of a deletion transaction.
I don't think that the current system is complete, and there's been a lot of questioning that has delayed the work deploying it. However, all the problems are performance based, and two solutions should fix this:
Implementing some form of back off polling. It really doesn't matter in what form, as long as every item is not polled every polling cycle. My prior arguments for the previously rejected solution at hand was that is was available. It was being compared to solutions that didn't yet fully exist.
Implementing some form of batch querying. Gorm supports this, and it would reduce the processing by having a single query handle 200 or more items simultaneously without opening 200 different querying contexts.
However, if the maintainers wish to discard the currently observed to be error free solution for a new one, it's the maintainer's call. However, this does seem a bit reactionary, as if it were directly in response to not accepting one of many back off algorithms. I'd rather see a sub-optimal but functional back off algorithm accepted over a new means of scanning the database that clearly would miss records that arrive after a few minute delay.
For a volunteer effort, this also seems like re-inventing the solution by rewriting from scratch, an approach that is known to have costs and risks that often are under estimated. In many ways, this rolls back the clock on the current solution nearly six months, when these ideas were deemed insufficient to actually capture all changes.
* How long transactions take to commit in your database (for all but the prune transaction, this could be synthesized via RPC metrics). _Outliers are again of particular interest to determine a reasonable value for `N`._
There's no perfect answer to this, as we have very few observed items in the field that led to the failure this system was designed to fix. Or, it is equally possible that people didn't inform me of details which were the input into the current design. So, N
in the current design is "the entire duration a transaction can exist". A practical approach might permit N
to be much smaller, but a smaller N
comes with increasing risks that the cache and the database get out of sync and remain out of sync for long periods of time (days, or longer).
Providing some numbers:
registration entry changes in a typical day are <5k distributed evenly. however during a mass failover event we aim to handle a 200k-1M burst (which we recognize results in initially degraded latencies, but it is expected to eventually "even out")
Hmm. I think that failover event blows up the first assumption "Number of events in the last N minutes is manageably sized" in a crippling way. In previous discussions I assumed that 1M registration changes were more or less distributed over a one day period (which is a reasonable ~3000 events every 5 minutes). If those registration changes are going to come in as fast as they can in the failover scenario, that number is no longer reasonable.
@amoore877 I see a lot of focus on p99 numbers. The main issue is that anything that's below p99 works, but the failures are not statistically going to be captured by p99.
A single long-lived database transaction will be enough to cause cache synchronization issues. If we focus on p99, then that means we only need 100 successful transactions to ignore the failing one, and our cache will still be out-of-sync.
For this reason, I don't see this as an area where statistical analysis takes precedence over a very old, traditional concept of "code correctness". We had a system where one event outside of the p99 can break the system, and in the past we realized that saying it worked 99% of the time didn't buy sympathy for the one time it led to a running server that didn't reflect its backing database.
That's why we treat every database change that might exist as something to consider until it cannot exit. It is an extreme point of view, but one that ensures that the cache will always be in-sync.
Should we stop polling the "might newly exist" changes at the same frequency we poll the entire database for new records, we can reduce the impact of polling the "might newly exist" records a lot. The previously rejected submission reduced it by 1/60th, but that was an arbitrary setting. While I argued extensively for that solution (it had a lot of nice features, which led to its approach), any solution that polls with less frequency would be sufficient.
With the proposed approach in this Issue, any transaction that managed to live past the setting for N
would forever be missed, and the cache would be out of sync until one of the following occurred:
The first scenario is undesirable, because there's a lot that goes on in a server restart, including denying services to the other unaffected SVIDs.
The second scenario is undesirable, because nobody is going to make the call on what the frequency of "having the right data in the cache" is. Previous calls for this went unanswered. Late in our analysis, it was shared with me that Uber, the reporting entity, was updating their cache once a second. In many minds, any delays sill create failures on the Pod or process, leading to quick reloads. However, the faster the full reload, the more we will see the benefits of the db event approach disappear. At five seconds, it would be more efficient for the db event approach to simply not exist (but that also comes with 10,000 times more database CPU overhead).
The third scenario basically is indeterminable. One can't surmise when the record might be updated.
For this reason, I would feel more comfortable with a embarrassingly bad polling back off algorithm than re-writing the framework from scratch.
I put too much of my social capital in attempting to get the previously rejected back off algorithm accepted. I regret this, and to a degree, I partially see this solution as a wholesale call to disregard all lessons learned for a "simpler time".
Please do not be swayed by p99, because it means that after 100 cache updates, it's okay if your cache is out of sync once.
I recommend that we choose any back off algorithm, even one known to be embarrassingly bad, instead of rewriting the currently existing solution which (to my knowledge) has been running error free for multiple releases.
FWIW, this failover event would generate 1M skipped events to track for 12 hours (current transaction timeout value). Even with the backoff algorithm that has not yet been accepted, that would be roughly 16K queries a second for 12 hours.
@azdagron I'm ok with polling once per hour, as long as there is one poll that is guaranteed to be beyond the limit of the transaction timeout. That said, I fully understand that the people who are going to be impacted really are the people that have the authority to set the limit.
Unfortunately, there's no limit I've ever received. For this reason, I created a polling system that could be configured to poll at any limit, in any manner (some recommended exponential backoff, some indicated that a maximum limit would be desired, and linear backoff seemed better suited to the problem).
For this reason the backoff system I previously proposed was configurable, which made it far more complex than a non-configurable system. The decision to make it configurable I now see as a mistake.
I recommend that we implement your previous priority queue backoff, which I don't see as optimal, but at least adding it to a correct polling approach is preferable to possibly creating a new approach that would miss database changes. Of course, your approach uses random numbers, making it very difficult to test deterministically, but even that's better than "let's start all over again" which would likely recreate the issues that led to the solution we currently have.
The experimental events-based cache relies on tracking change events for registration entries and attested nodes. The current algorithm relies on the monotonically increasing property of the event ID. Previous assumptions around arrival order of events and event ID increment stepping have contributed to a series of fixes to the current algorithm that have complicated the codebase (e.g. skipped event discovery and tracking) and contributed to pessimistic database query load, when one of the primary goals for the events-based cache is to reduce database load in the steady state. Organizations whose deployments tickle the conditions above are not realizing as much benefit from the events-based cache as should be possible.
I propose the following algorithm change to the events based cache:
processed_events
, which is initially empty.N
minutes are retrieved, based on the event CreatedAt column, into a listrecent_events
processed_events
andrecent_events
recent_events
but not inprocessed_events
(i.e. new events) are processed and added toprocessed_events
processed_events
but not inrecent_events
(i.e. old events) are removed fromprocessed_events
processed_events
andrecent_events
(i.e. already processed events) are ignoredWe can handle failures to talk to the database in a given poll cycle by either:
N
, last time we successfully polled), orThis algorithm assumes a few things:
N
minutes is manageably sized: Seems reasonable if we keepN
small, e.g. 3 minutesN
minutes: As far as I know, the only mutating operation that has the potential for a long-lived transaction is that which prunes registration entries, which executes a single statement to delete all registration entries older than some threshold. When the registration entry set is large, this operation can take some time. We can likely reduce the per-transaction time by doing the deletion in small batches and more frequently.PROS:
N
minute window.CONS:
N
minute window, but will fully reduce afterN
minutes.I have prototyped the above algorithm and so far have not witnessed split brain in the cache though more testing is required.