Improve & unify parallel de-duplication caches

michaelsproul commented 8 months ago

Description

In https://github.com/sigp/lighthouse/pull/4879 I somewhat hastily introduced a new cache on tree-states to improve the performance of parallel HTTP requests. I'm now thinking that this wasn't the best way to implement it, because:

The cache is never pruned, so any states that end up there stay there until they're replaced. This means if a bunch of states that aren't anchored to the store's main state cache are present, then they will stay in memory consuming a bunch of RAM until new requests evict them.
The cache needs to be manually used in every HTTP API handler, and many of the ones that people commonly use (e.g. rewards) don't yet utilise the new cache. Unifying caches and simplifying things to remove footguns and performance pitfalls is one of the main goals of tree-states and its primary state_cache, so the current situation is not ideal.
The cache fundamentally serves the same purpose as the committee cache, which is why the implementation is so similar.

Version

Lighthouse v4.6.111-exp.

Steps to resolve

The basic idea is:

Push the parallel cache (PromiseCache) down into the store. Maybe it sits in front of the state_cache but only stores slots or state_roots (not full states).

The problem is how to manage freezer states. They aren't in the regular state_cache and have very different logic for fetching, as well as the ability to index by slot. One option would be two promise caches, where the freezer one sits in front of the diff_buffer_cache (indexed by slot). Alternatively it may make sense to revive the historic state cache, which so far is lying dormant on the tree-states branch (replaced by the diff buffer cache).

dknopik commented 7 months ago

I spent some time thinking about this.

Pulling the parallel state cache down into the store does not feel quite right to me, even if it only stores slots or state roots. As the parallel state cache (at least in its current implementation) evicts in arbitrary order, a value present in the parallel state cache would not guarantee a value being present in the state_cache or diff_buffer_cache respectively.

A variant of the PromiseCache that does not retain finished values would be more adequate in my opinion (called VariantPromiseCache below for clarity). The flow when getting a state via e.g. get_state might look like this:

Check in the VariantPromiseCache (by state root) whether a load for this state is already in progress. If so, await the oneshot broadcast channel. Otherwise:
Check the adequate specialized caches (state_cache and/or diff_buffer_cache). If a state is found, use it. Otherwise:
Create a oneshot broadcast channel and add it to the VariantPromiseCache.
Load the state.
Broadcast the loaded state and remove the channel from the VariantPromiseCache.

I am not sure about the details of that idea yet, e.g. whether a lock on the VariantPromiseCache should be held while the specialized caches are checked.

If this idea makes any sense I am open to implementing it to check if it works well, though I would probably need some advice for proper benchmarking.

michaelsproul commented 7 months ago

@dknopik Hey, that's awesome. I came to the same conclusion about needing to remove the complete values from the cache. I started working on it on this branch but got distracted by other things: https://github.com/michaelsproul/lighthouse/commits/tree-states-super-cache/

I was thinking we could check the state_cache/diff_buffer_cache prior to checking the promise cache, as that way if we get a hit there's no need to interact with the promise cache. We may just need to be careful about lock ordering to avoid deadlocks.

For the freezer, I was thinking that we could iterate back through the slots corresponding to diff layers, and check the diff_buffer_cache/promise_cache for each one. So if we're loading the state at slot 2049 and it could be built using diffs from 2048, 1024 and 0 (for example), we would check for 2049 in the diff cache, if it's there, use it, else check the promise cache and add a promise for it (because we will compute it), then check for 2048 and do the same, and so on until we get a hit or reach the snapshot layer (slot 0). Then as we load diff buffers, we add them to the cache and fulfil any promises we made for them. This seems like it could be a bit complicated to reason about though.

If you'd like to work on this I'd love to continue collaborating async. Feel free to use or not use my branch that I started as you see fit!

dknopik commented 7 months ago

Awesome, I'll check it out and experiment a bit and get back to you with any findings! :)

sigp / lighthouse