Is your feature request related to a problem? Please describe
Currently in tiered/pluggable caching, keys move to the on-heap tier to the disk tier when evicted by the heap tier’s LRU policy. But, unless the key is evicted from the disk tier and then re-added to the heap tier, there’s no way for keys to move back up.
Consider a key that gets hits moderately often. It can get evicted from a small LRU heap tier, since the heap tier can’t hold many keys. But, it’s not likely to be evicted from the large LRU disk tier: there are so many more keys there that it will rarely be the least-recently used key. So, all future hits on this key will come from the disk tier, which is slower than the heap tier. This problem should worsen the longer the node is running as more and more keys get stuck.
In some cases, key invalidation helps by removing stale keys from both tiers. But not always - for example, on an index of past logs, there won’t be any invalidation.
Ideally, we would keep the few most frequently-used keys in the heap tier and use the larger disk tier for a larger number of less-frequently used keys.
Describe the solution you'd like
We want the TieredSpilloverCache to keep track of how often keys get hits from the disk tier. If a key gets more hits than some threshold, we move it back to the heap tier (“promoting” it). This causes the heap tier to evict its least-recently used key into the disk tier. In practice we can’t do exactly this for performance reasons.
Keeping an in-memory hit counter for every on-disk key is not feasible. Instead, we can use a memory-efficient frequency approximating data structure like CountMinSketch, which is used by high-performance caching libraries like Caffeine. The data structure is pretty simple so we could implement it ourselves, or we could use some library’s implementation. Because the values in CountMinSketch can decay over time in a configurable way we don’t have to worry about removing keys from the structure, or about the frequency estimations creeping up over time and leading to too many promotions.
Promotions will have some small performance cost, so we can’t use small threshold values like 1. I don’t expect the cost to be very large though, as it’s just invalidating the key from the disk tier, adding it to the heap tier, and having the heap tier evict some other key. The last 2 steps already happen every time there’s a cache miss, and the invalidation can happen regularly from stale keys. So, as long as the number of promotions is << the number of total cache misses, I don’t expect much overhead from the promotions themselves.
Related component
Storage:Performance
Describe alternatives you've considered
Promoting based on just a threshold may cause significantly different numbers of promotions based on the workload, especially the fraction of queries that can be repeated. We may want to tweak this based on cache hit ratio or something similar. This would probably be tested through benchmarks.
As a more distinct alternative, we can consider using Caffeine as the heap tier instead of the LRU OpenSearch cache. Caffeine has a smarter eviction policy that approximates LFU, so the keys that go to the disk tier are less likely to get many hits in the future. This reduces the “stuck keys” problem without actually having to do any promotions. Some work has already been done on this, but we haven’t gotten it to be more performant than the default cache as of yet. It could also make sense for Caffeine to go alongside key promotion.
Additional context
Some preliminary numbers which give an idea of how big the problem is and upper bounds for how much we can expect this to improve performance:
How often do keys get stuck in the disk tier?
We can run a simulation of the randomized nyc_taxis workload which we’ve used to measure TC performance, where we track frequencies of hits for all keys in both heap and disk tier. Then we can see if keys actually do get stuck. We can measure the fraction of total hits coming from the heap tier, or “heap hit rate fraction” / “HHRF”. Maximizing HHRF should increase performance.
The keys are ordered along the x-axis by their “Zipf index” i. In each benchmark iteration, this determines the probability we pick the query which matches key i, where smaller values of i are the most probable. Key i has probability proportional to 1 / i^alpha. This follows how randomization works in OSB. Note the log-scale x-axis. This figure is set up so the area of orange/blue is proportional to the total number of heap/disk hits.
Orange means a hit from the heap tier; blue means a hit from the disk tier. We can see the few most common keys are always served from the heap tier (their whole bar is orange). But, as early as the ~10th most common key, we see hits from the disk tier. For cheap queries this happens by the ~30th most common key; cheap queries happen more often in the test so their keys are less likely to be evicted from the heap tier.
How much does enabling promotion increase HHRF?
In the simulation, we increment a hit counter every time a disk-tier key gets a hit. If this counter reaches some threshold value we immediately move it up to the heap tier. This is an ideal case of what we might do in practice.
If we run the above test with various values for this threshold, we see varying levels of improvement:
The most improvement in HHRF is seen for small threshold values, with the minimum value of 1 performing the best by far. But even for large values like 20, we see substantial improvement over the baseline with no promotion at all.
The number of promotions was much higher for small threshold values:
But even for large values with few promotions, we still see a significant HHRF improvement. Of course this could be pretty workload-dependent.
Is your feature request related to a problem? Please describe
Currently in tiered/pluggable caching, keys move to the on-heap tier to the disk tier when evicted by the heap tier’s LRU policy. But, unless the key is evicted from the disk tier and then re-added to the heap tier, there’s no way for keys to move back up.
Consider a key that gets hits moderately often. It can get evicted from a small LRU heap tier, since the heap tier can’t hold many keys. But, it’s not likely to be evicted from the large LRU disk tier: there are so many more keys there that it will rarely be the least-recently used key. So, all future hits on this key will come from the disk tier, which is slower than the heap tier. This problem should worsen the longer the node is running as more and more keys get stuck.
In some cases, key invalidation helps by removing stale keys from both tiers. But not always - for example, on an index of past logs, there won’t be any invalidation.
Ideally, we would keep the few most frequently-used keys in the heap tier and use the larger disk tier for a larger number of less-frequently used keys.
Describe the solution you'd like
We want the TieredSpilloverCache to keep track of how often keys get hits from the disk tier. If a key gets more hits than some threshold, we move it back to the heap tier (“promoting” it). This causes the heap tier to evict its least-recently used key into the disk tier. In practice we can’t do exactly this for performance reasons.
Keeping an in-memory hit counter for every on-disk key is not feasible. Instead, we can use a memory-efficient frequency approximating data structure like CountMinSketch, which is used by high-performance caching libraries like Caffeine. The data structure is pretty simple so we could implement it ourselves, or we could use some library’s implementation. Because the values in CountMinSketch can decay over time in a configurable way we don’t have to worry about removing keys from the structure, or about the frequency estimations creeping up over time and leading to too many promotions.
Promotions will have some small performance cost, so we can’t use small threshold values like 1. I don’t expect the cost to be very large though, as it’s just invalidating the key from the disk tier, adding it to the heap tier, and having the heap tier evict some other key. The last 2 steps already happen every time there’s a cache miss, and the invalidation can happen regularly from stale keys. So, as long as the number of promotions is << the number of total cache misses, I don’t expect much overhead from the promotions themselves.
Related component
Storage:Performance
Describe alternatives you've considered
Promoting based on just a threshold may cause significantly different numbers of promotions based on the workload, especially the fraction of queries that can be repeated. We may want to tweak this based on cache hit ratio or something similar. This would probably be tested through benchmarks.
As a more distinct alternative, we can consider using Caffeine as the heap tier instead of the LRU OpenSearch cache. Caffeine has a smarter eviction policy that approximates LFU, so the keys that go to the disk tier are less likely to get many hits in the future. This reduces the “stuck keys” problem without actually having to do any promotions. Some work has already been done on this, but we haven’t gotten it to be more performant than the default cache as of yet. It could also make sense for Caffeine to go alongside key promotion.
Additional context
Some preliminary numbers which give an idea of how big the problem is and upper bounds for how much we can expect this to improve performance:
How often do keys get stuck in the disk tier?
We can run a simulation of the randomized nyc_taxis workload which we’ve used to measure TC performance, where we track frequencies of hits for all keys in both heap and disk tier. Then we can see if keys actually do get stuck. We can measure the fraction of total hits coming from the heap tier, or “heap hit rate fraction” / “HHRF”. Maximizing HHRF should increase performance.
The keys are ordered along the x-axis by their “Zipf index” i. In each benchmark iteration, this determines the probability we pick the query which matches key i, where smaller values of i are the most probable. Key i has probability proportional to 1 / i^alpha. This follows how randomization works in OSB. Note the log-scale x-axis. This figure is set up so the area of orange/blue is proportional to the total number of heap/disk hits.
Orange means a hit from the heap tier; blue means a hit from the disk tier. We can see the few most common keys are always served from the heap tier (their whole bar is orange). But, as early as the ~10th most common key, we see hits from the disk tier. For cheap queries this happens by the ~30th most common key; cheap queries happen more often in the test so their keys are less likely to be evicted from the heap tier.
How much does enabling promotion increase HHRF?
In the simulation, we increment a hit counter every time a disk-tier key gets a hit. If this counter reaches some threshold value we immediately move it up to the heap tier. This is an ideal case of what we might do in practice.
If we run the above test with various values for this threshold, we see varying levels of improvement:
The most improvement in HHRF is seen for small threshold values, with the minimum value of 1 performing the best by far. But even for large values like 20, we see substantial improvement over the baseline with no promotion at all.
The number of promotions was much higher for small threshold values:
But even for large values with few promotions, we still see a significant HHRF improvement. Of course this could be pretty workload-dependent.