Open VladRodionov opened 1 year ago
This is the problem statement, not an immediate priority issue we have to invest time in. But two issues with the current algorithm must be fixed: memory miscalculation and possible under-eviction.
fyi - it's easier to follow these threads if you reply to the original message rather than posting a top-level comment
There are two bugs here, which are critical: they are referred to as point 2 and point 5. in the original post. Otherwise, yes - the new eviction is a future thing.
This is all potentially interesting as an academic exercise, but I'm not sure given the overall mandate to stay laser-focused on getting to production in the first place that we should really be spending this much time and energy looking into optimizing our eviction algorithhm
Random eviction works better than LRU when access follows anti-recency pattern (recently accessed object is less likely be accessed in a near future) - this is not the case for any workload which is cacheable. Random eviction is on par with LRU when access is totally random, again this is unrealistic workload. In all other cases LRU wins. And LRU is not the most advanced policy. This paper has some comparisons for different eviction policies, including random. https://arxiv.org/pdf/1512.00727.pdf. If Random would be so good nobody would invest their time to develop LRU, LFU, W-LFU, Q2, LIRS, ARC etc etc. The eviction algorithm from a paper you were referring to is not a truly Random. It has random selection stage only. It creates small subset of objects by randomly selecting them (continuously) from a main set, but then it evicts the oldest objects from this subset, as far as I remember. It surely worse than general LRU but better than Random. Its approximation of LRU which works on a small subset of overall cache data.
(there are a lot information on the Internet)
This is less helpful than providing some direct sources off of which you are basing your assertions.
I have seen random eviction be used with some success and was wondering why you consider it the least efficient. LRU can be hard to maintain a record of especially across disjoint sets of data.https://danluu.com/2choices-eviction/
https://db.in.tum.de/\~leis/papers/leanstore.pdf
> Yes. But we need memory allocated by RS only.
Why? Isn't the goal of eviction to keep the process under a certain memory threshold?
LRU eviction is done for readers, but not for internal materializations, because sending information about reads to compute LRU bookkeeping up the graph is a Hard Problem
does this really belong in the "critical bugs for ER" project?
Eviction is performed in Domain::handle_eviction
, eviction candidate list is selected based on node sizes (not random). MemoryState::evict_bytes
which is called later selects rows randomly (MemoryState::evict_random
). I found no references to any LRU- related data structure in the eviction code path. May be I missed it, so I will double check today.
On efficiency of Random vs. others. Algorithm efficiency and memory usage are orthogonal. Additional bookkeeping is required for LRU, for example, but it does not affect its efficiency, which is measured by hit or miss ratio for a particular workload with a given cache size.
the only workload where random eviction can compete with other algorithms is random uniformly distributed access
only if you don't factor in the bookkeeping cost of recording which keys have been most-recently read
the default eviction strategy is LRU, not random: https://github.com/readysettech/readyset/blob/5ec178edabf6871178ce136d1108ae64e67a4efa/readyset-server/src/lib.rs#L571
How did you come to this conclusion? Why is random eviction the least efficient method?
On random eviction. The only workload where random eviction can compete with other algorithms is random uniformly distributed access. Every algorithm designer usually compares his algorithm against state-of-the art competitors and against LRU which is most commonly used. LRU beats Random by a wide margin on all realistic workloads (there are a lot information on the Internet)
Isn't the memory allocated by rocksdb accounted for in the numerator of step 3 of the algorithm above? (Memory allocated as reported by jemalloc) / (Total memory reported by all domains)
Yes. But we need memory allocated by RS only.
Is this implicitly assuming that the unit of equality/fairness for cache usage is a row rather than an amount of memory? Why is it necessarily better to strive for fairness between cache hit rates per table/row rather than % of memory usage of the cache that is involved in cache hits?
The major source of discrepancy between allocated memory in MaterializedNodeState
and reported by MaterializedNodeState::deep_size_of
is data structures overhead. Total allocated memory = overhead + rows + keys. Overhead is O(N), where N is number of rows it does not depend on row and key sizes. So, the smaller rows the larger ratio of data structures overhead in a total memory allocation. It means that for wider rows relative overhead is smaller than for narrow rows, ditto for wide and narrow tables and domains which host them (probably does not hold always, but most of the time).
Random Eviction: The algorithm currently employs random eviction, which is the least efficient method when the working set size (WSS) exceeds the cache size.
How did you come to this conclusion? Why is random eviction the least efficient method?
Memory Miscalculation: The algorithm does not account for memory allocated by RocksDB, which is reported by jemalloc as part of the process memory usage.
Isn't the memory allocated by rocksdb accounted for in the numerator of step 3 of the algorithm above? (Memory allocated as reported by jemalloc) / (Total memory reported by all domains)
Domain Size Adjustment Bias: Domains that host wide tables and narrow tables often require different size adjustments. Currently, there may be overestimation for wide tables and underestimation for narrow tables. Narrow tables, with their narrow rows, typically incur additional memory overhead compared to wide rows.
Is this implicitly assuming that the unit of equality/fairness for cache usage is a row rather than an amount of memory? Why is it necessarily better to strive for fairness between cache hit rates per table/row rather than % of memory usage of the cache that is involved in cache hits?
In the short term, we can focus on addressing issues 2 and 5. With REA-3411, we will gain the ability to estimate memory allocated by RocksDB, thereby enabling us to make more accurate estimates for RS memory allocation. This, in turn, will improve the overall algorithm to ensure correctness, with the key principle being that the evicted data size must always be equal to or greater than the requested size
Worker::do_eviction
function.Worker
selects the three largest domains as candidates for eviction. Subsequently, it applies a filter, discarding domains that are less than 10% of the size of the largest one. This filter has the potential to further reduce the list of domain candidates.Worker
employs an adjustment ratio calculated as follows: ratio = (Memory allocated as reported by jemalloc) / (Total memory reported by all domains).Worker
then calculates the excess memory above the limit, denoted as 'S,' which is computed as (memory allocation reported by jemalloc
) - limit
Worker
further calculates how many bytes each domain from the candidate list must evict proportionally based on their sizes. In other words, it calculates S1, S2, and S3 such that S1 + S2 + S3 = S (assuming three domains).Worker
sends a request to evict a specified number of bytes.
Domain
s execute these eviction requests by removing data from their own eviction candidate lists (Node
s). These lists are constructed in a similar manner to the Worker
's list.The above algorithm exhibits several shortcomings, some of which can be immediately addressed, while others may require additional work. These issues are listed below in order of importance:
Overly Aggressive Eviction: The current algorithm tends to evict more data than necessary due to overestimation of large domain sizes. Additionally, it does not take into account downstream evictions.
Addressing these shortcomings is essential to enhance the efficiency, reliability, and accuracy of the eviction algorithm.
Caching is a fundamental technique used in computing to optimize data access and retrieval. It involves storing frequently accessed data in a high-speed memory, typically faster but smaller than the main data storage, for quick and efficient retrieval. While caching greatly enhances performance, it introduces the challenge of managing limited cache space effectively. Data eviction, the process of deciding which data to remove from the cache to make room for new data, plays a pivotal role in ensuring the cache functions optimally.
In conclusion, data eviction is a critical component of caching systems that directly impacts system performance, resource utilization, and user experience. Well-designed eviction policies are essential for ensuring that caches remain efficient, responsive, and adaptive to changing data access patterns, ultimately contributing to the overall effectiveness of a system.
Ideally, a good eviction algorithm should meet all seven of these requirements. However, in practice, there is often a trade-off between functionality, performance, and complexity.
To be continued …
There is some description in http://docs/dataflow/memory_management.html if it is useful to use as a starting point.
Description
This ticket is to describe the current implementation of data eviction and to propose improvements to the eviction algorithm.
Change in user-visible behavior
N/A
Requires documentation change
Yes. The data eviction algorithm should be described in the documentation.
From SyncLinear.com | REA-3431