Smarter Authorized Entry Caching

azdagron commented 3 years ago

In 0.12.0 we introduced a full in-memory authorized entry cache. This greatly reduced the server load in large deployments by doing a single full table scan of the entries and attested nodes just once every interval (5s) instead of the potentially deep authorized entry crawls that were happening on per-agent on every sync interval. This cache essentially made the authorized entry lookup performance consistent, which the database load linearly scaled to the number of entries and agents in the database.

However, it is not a complete solution. Each server is doing this full reload every five seconds. In large deployments, this can still equate to a giant amount of load on the shared database. Issues like #2147 have been opened to try and mitigate this database load. While these kind of band-aids might work in the short term, there is a smell here around this kind of configurable even being necessary. This kind of knob, along with the agent sync interval knob, are hard to reason about, have unintended consequences (i.e. how long it takes for an entry change to propagate to the agents and down to workloads), and have strange interplay.

We have an opportunity to do something smarter here that precludes doing a full scan every five seconds and removes the need for any of these kind of knobs.

One possibility is for each server to determine the "diff" of the cache state since the last reload and only load what is necessary. There are probably other solutions. The ultimate solution should 1) remove the need for tuning, 2) provide reasonable responsiveness to entry/agent changes, 3) have optimally reduced database load.

zmt commented 3 years ago

This reminds me of old memcache. Could we implement a distributed write-through cache?

As in something like:

maintain a cache distribution stream among all spire-server peers
when any spire-server gets a write, it streams to all its peers in parallel to writing to the DB

That would effectively relegate the DB to a recovery/re-sync mechanism. Is this crayz talk? Would this just introduce additional unwanted complexity in the form of a vector clock or similar distributed sync mechanism? Could we use something like memcached with some modifications for our desired security/consistency properties?

rturner3 commented 3 years ago

Implementing a custom distributed cache in SPIRE would be a pretty huge undertaking that I think would add more complexity than is necessary for a solution to this problem. I think there are some simpler solutions we can consider before going down that route.

rturner3 commented 3 years ago

I tried to brainstorm a few high-level possible solutions to this problem. I'd be curious to hear others' thoughts around this and whether any of the cons in the first couple options feel like "deal-breakers".

Option 1 - Server dynamically determines cache sync interval based on cache hydration time

Rather than relying on a static cache sync interval for the server, introduce new logic in the server that dynamically computes the duration it should wait until attempting the next cache hydration based on how long the previous entry cache hydration took. A simple implementation of this could be for the server to wait for the same duration that the cache sync took to complete, e.g. if the cache hydration took 10 seconds, wait 10 seconds until attempting the next cache sync.

Pros

Adapts query frequency to real world performance of database to prevent flooding it with requests - flexible for different deployment models and scaling scenarios
Minimal changes to existing design

Cons

Difficult to reason about identity propagation delay as a SPIRE operator/registrant
Schedulers and workloads often make assumptions about workload identity availability and define reasonable timeouts around how long it should take for a workload to get its identity on startup, or more generally, how long the workload should take to initialize in a healthy state. Since workload identity propagation delay is significantly more dynamic with this design, defining sane application startup timeouts is more of a guessing game since the control plane behavior is less predictable and may change over time.
Doesn't actually address the problems around heavy database load generated from the cache sync, just spaces the load spikes out
May result in excessive propagation delay

Option 2 - Server dynamically determines cache sync interval based on size of cache

Essentially the same as Option 1 except that rather than using last cache hydration duration as the input for determining the cache sync interval, select a cache sync interval based on the size of the cache, i.e. # of registrations and # of non-expired agents. The server would have a hardcoded set of sync intervals for different cache sizes and use the appropriate sync interval based on the cache size.

Below is just an example of what such a sync interval table could look like: < 10,000 registrations + agents = 5 second sync interval < 100,000 registrations + agents = 10 second sync interval < 500,000 registrations + agents = 30 second sync interval > 500,000 registrations + agents = 60 second sync interval

Pros

Query frequency is adapted based on the "size" of data, but in a more deterministic fashion than Option 1
Simple to implement

Cons

The number of instances in a server HA deployment, the resources available to the DB cluster, and overall DB performance are not directly considered in the sync interval calculation. The server uses the data size as a heuristic for these scale dimensions.
May result in increase in identity propagation delay as a deployment scales up over time, which could be unexpected to operators and integrators
Similar to Option 1, does not actually remove the high load on the database

Option 3 - Change rebuild of full cache to sync only the changes from database

The idea here is to avoid doing full table scans of the registered_entries, attested_node_entries, and node_resolver_map_entries database tables to rebuild the cache. Instead, whenever rows in those tables are added, modified, or deleted, track those events in a new event table. The events can be populated by a SQL trigger that gets added to the database by the sql datastore code.

The cache rebuild task would follow a procedure such as:

cache_build_time = current time
Read event table for all events with timestamps >= last_cache_rebuild_time and < cache_build_time
For each create event, fetch the corresponding record from the DB and add it to the cache
For each update event, fetch the corresponding record from the DB and update the existing value in the cache
For each delete event, find the corresponding item in the cache and delete it
last_cache_rebuild_time = cache_build_time

To prevent the event table from growing indefinitely, there can be a background task in SPIRE server to purge old events after some period of time, e.g. 30 days. Since the events are only internally consumed by SPIRE, they likely can be purged even more frequently.

Pros

Minimizes load on database and greatly reduces database resource requirements for higher scale environments
Cache sync interval can be static and unconfigurable again, reducing the chance of system misconfiguration

Cons

Requires schema changes
More complex implementation that will take several releases to fully realize

evan2645 commented 3 years ago

I like option 3 because it feels like a solution to the underlying problem.

I believe we already store timestamps in the DB under the covers, perhaps that can be utilized without a migration?

For deletes, one idea is to periodically get the count, and if we see a mismatch, we can iterate by entry id and purge the old ones? I think entry id is indexed so hopefully that hit wouldn't be as hard as what we're doing now.

This could be a bad idea :) but moving away from a really expensive query that we have to manage carefully feels like the ideal direction

zmt commented 3 years ago

Implementing a custom distributed cache in SPIRE would be a pretty huge undertaking that I think would add more complexity than is necessary for a solution to this problem. I think there are some simpler solutions we can consider before going down that route.

To clarify, I started my comment about "implementing a distributed write-through cache" with the idea of implementing something in SPIRE since this issue was framed in terms of optimizing current cache implementation in SPIRE. I was focused on "fixing" the lack of distributed/shared cache as a mechanism to reduce DB impact. By the end of drafting my earlier suggestion, I wound up convinced that developing an option (a plugin, perhaps?) to adapt an existing, purpose-built solution to write-through distributed cache (e.g. memcache, redis, hazelcast, ignite, etc.) would be far better than re-inventing. I should have spelled that out more clearly. Of course, careful risk analysis and proper config advice/enforcement would be required before introducing such a potential vector.

I like option 3 because it feels like a solution to the underlying problem.

I also like option 3 as an improvement to the current cache implementation. Iff the non-distributed cache implementation stops continuing to scale with optimizations like this, we should reconsider other options like I suggested above.

rturner3 commented 2 years ago

For deletes, one idea is to periodically get the count, and if we see a mismatch, we can iterate by entry id and purge the old ones? I think entry id is indexed so hopefully that hit wouldn't be as hard as what we're doing now.

This could be a bad idea :) but moving away from a really expensive query that we have to manage carefully feels like the ideal direction

I think trying to guess if entries were deleted based on the number of rows in the table wouldn't be a robust enough solution since a creation + a deletion in the same sync period could result in the same row count as the previous check. In that case, the cache wouldn't be properly updated to delete the old removed entry.

Apologies for letting this lie for so long. Just wanted to circle back - it sounds like there is a generally positive response to option 3 as a solution. Any other thoughts @evan2645 @azdagron @amartinezfayo @MarcosDY or are we good to proceed?

edwbuck commented 1 year ago

Options 1 and 2 may solve a pain point, but it still does unnecessary work of loading non-changed entries. The only difference is the policy on how frequently the cache is reloaded.

Option 3 is the only real algorithmic solution, and it would put the network response from the database to be roughly equivalent to the amount of change, instead the size of the database.

This means that future efforts on this task will be focused around capturing / detecting the changes to the database from the server's last time it was sync'd and altering the cache update to update rows within the cache instead of the entire cache. Care must be taken to appropriately detect removals, as we have no "default" cache eviction policy, so cache eviction must must be explicitly controlled.

evan2645 commented 1 year ago

Just wanted to quickly link here to an issue which might be a pre-req to some of this work, wherein we need to document which database backends are officially supported by the project: https://github.com/spiffe/spire/issues/3711

azdagron commented 1 year ago

Closing this in favor of #4498, which tracks the work remaining on this plan of action.

spiffe / spire