readysettech / readyset

Readyset is a MySQL and Postgres wire-compatible caching layer that sits in front of existing databases to speed up queries and horizontally scale read throughput. Under the hood, ReadySet caches the results of cached select statements and incrementally updates these results over time as the underlying data changes.
https://readyset.io
Other
4.3k stars 120 forks source link

Data eviction in Readyset Server #509

Open VladRodionov opened 1 year ago

VladRodionov commented 1 year ago

Description

This ticket is to describe the current implementation of data eviction and to propose improvements to the eviction algorithm.

Change in user-visible behavior

N/A

Requires documentation change

Yes. The data eviction algorithm should be described in the documentation.

From SyncLinear.com | REA-3431

VladRodionov commented 1 year ago

This is the problem statement, not an immediate priority issue we have to invest time in. But two issues with the current algorithm must be fixed: memory miscalculation and possible under-eviction.

VladRodionov commented 1 year ago

fyi - it's easier to follow these threads if you reply to the original message rather than posting a top-level comment

VladRodionov commented 1 year ago

There are two bugs here, which are critical: they are referred to as point 2 and point 5. in the original post. Otherwise, yes - the new eviction is a future thing.

VladRodionov commented 1 year ago

This is all potentially interesting as an academic exercise, but I'm not sure given the overall mandate to stay laser-focused on getting to production in the first place that we should really be spending this much time and energy looking into optimizing our eviction algorithhm

VladRodionov commented 1 year ago

Random eviction works better than LRU when access follows anti-recency pattern (recently accessed object is less likely be accessed in a near future) - this is not the case for any workload which is cacheable. Random eviction is on par with LRU when access is totally random, again this is unrealistic workload. In all other cases LRU wins. And LRU is not the most advanced policy. This paper has some comparisons for different eviction policies, including random. https://arxiv.org/pdf/1512.00727.pdf. If Random would be so good nobody would invest their time to develop LRU, LFU, W-LFU, Q2, LIRS, ARC etc etc. The eviction algorithm from a paper you were referring to is not a truly Random. It has random selection stage only. It creates small subset of objects by randomly selecting them (continuously) from a main set, but then it evicts the oldest objects from this subset, as far as I remember. It surely worse than general LRU but better than Random. Its approximation of LRU which works on a small subset of overall cache data.

VladRodionov commented 1 year ago

(there are a lot information on the Internet)

This is less helpful than providing some direct sources off of which you are basing your assertions.

I have seen random eviction be used with some success and was wondering why you consider it the least efficient. LRU can be hard to maintain a record of especially across disjoint sets of data.https://danluu.com/2choices-eviction/

https://db.in.tum.de/\~leis/papers/leanstore.pdf

> Yes. But we need memory allocated by RS only.

Why? Isn't the goal of eviction to keep the process under a certain memory threshold?

VladRodionov commented 1 year ago

LRU eviction is done for readers, but not for internal materializations, because sending information about reads to compute LRU bookkeeping up the graph is a Hard Problem

VladRodionov commented 1 year ago

does this really belong in the "critical bugs for ER" project?

VladRodionov commented 1 year ago

Eviction is performed in Domain::handle_eviction, eviction candidate list is selected based on node sizes (not random). MemoryState::evict_bytes which is called later selects rows randomly (MemoryState::evict_random). I found no references to any LRU- related data structure in the eviction code path. May be I missed it, so I will double check today.

On efficiency of Random vs. others. Algorithm efficiency and memory usage are orthogonal. Additional bookkeeping is required for LRU, for example, but it does not affect its efficiency, which is measured by hit or miss ratio for a particular workload with a given cache size.

VladRodionov commented 1 year ago

the only workload where random eviction can compete with other algorithms is random uniformly distributed access

only if you don't factor in the bookkeeping cost of recording which keys have been most-recently read

VladRodionov commented 1 year ago

the default eviction strategy is LRU, not random: https://github.com/readysettech/readyset/blob/5ec178edabf6871178ce136d1108ae64e67a4efa/readyset-server/src/lib.rs#L571

VladRodionov commented 1 year ago

How did you come to this conclusion? Why is random eviction the least efficient method?

On random eviction. The only workload where random eviction can compete with other algorithms is random uniformly distributed access. Every algorithm designer usually compares his algorithm against state-of-the art competitors and against LRU which is most commonly used. LRU beats Random by a wide margin on all realistic workloads (there are a lot information on the Internet)

Isn't the memory allocated by rocksdb accounted for in the numerator of step 3 of the algorithm above? (Memory allocated as reported by jemalloc) / (Total memory reported by all domains)

Yes. But we need memory allocated by RS only.

Is this implicitly assuming that the unit of equality/fairness for cache usage is a row rather than an amount of memory? Why is it necessarily better to strive for fairness between cache hit rates per table/row rather than % of memory usage of the cache that is involved in cache hits?

The major source of discrepancy between allocated memory in MaterializedNodeState and reported by MaterializedNodeState::deep_size_ofis data structures overhead. Total allocated memory = overhead + rows + keys. Overhead is O(N), where N is number of rows it does not depend on row and key sizes. So, the smaller rows the larger ratio of data structures overhead in a total memory allocation. It means that for wider rows relative overhead is smaller than for narrow rows, ditto for wide and narrow tables and domains which host them (probably does not hold always, but most of the time).

VladRodionov commented 1 year ago

Random Eviction: The algorithm currently employs random eviction, which is the least efficient method when the working set size (WSS) exceeds the cache size.

How did you come to this conclusion? Why is random eviction the least efficient method?

Memory Miscalculation: The algorithm does not account for memory allocated by RocksDB, which is reported by jemalloc as part of the process memory usage.

Isn't the memory allocated by rocksdb accounted for in the numerator of step 3 of the algorithm above? (Memory allocated as reported by jemalloc) / (Total memory reported by all domains)

Domain Size Adjustment Bias: Domains that host wide tables and narrow tables often require different size adjustments. Currently, there may be overestimation for wide tables and underestimation for narrow tables. Narrow tables, with their narrow rows, typically incur additional memory overhead compared to wide rows.

Is this implicitly assuming that the unit of equality/fairness for cache usage is a row rather than an amount of memory? Why is it necessarily better to strive for fairness between cache hit rates per table/row rather than % of memory usage of the cache that is involved in cache hits?

VladRodionov commented 1 year ago

In the short term, we can focus on addressing issues 2 and 5. With REA-3411, we will gain the ability to estimate memory allocated by RocksDB, thereby enabling us to make more accurate estimates for RS memory allocation. This, in turn, will improve the overall algorithm to ensure correctness, with the key principle being that the evicted data size must always be equal to or greater than the requested size

VladRodionov commented 1 year ago

Data eviction in RS - current implementation

  1. Eviction in RS occurs periodically at predefined intervals and is implemented within the Worker::do_eviction function.
  2. The Worker selects the three largest domains as candidates for eviction. Subsequently, it applies a filter, discarding domains that are less than 10% of the size of the largest one. This filter has the potential to further reduce the list of domain candidates.
  3. To adjust the sizes of these domains from the eviction candidate list, the Worker employs an adjustment ratio calculated as follows: ratio = (Memory allocated as reported by jemalloc) / (Total memory reported by all domains).
  4. The Worker then calculates the excess memory above the limit, denoted as 'S,' which is computed as (memory allocation reported by jemalloc) - limit
  5. The Worker further calculates how many bytes each domain from the candidate list must evict proportionally based on their sizes. In other words, it calculates S1, S2, and S3 such that S1 + S2 + S3 = S (assuming three domains).
  6. For each domain within the candidate list, the Worker sends a request to evict a specified number of bytes. Domains execute these eviction requests by removing data from their own eviction candidate lists (Nodes). These lists are constructed in a similar manner to the Worker's list.

The above algorithm exhibits several shortcomings, some of which can be immediately addressed, while others may require additional work. These issues are listed below in order of importance:

  1. Random Eviction: The algorithm currently employs random eviction, which is the least efficient method when the working set size (WSS) exceeds the cache size.
  2. Lack of Guarantee for Correct Work: The current algorithm does not ensure that a required amount of memory will be released. For example, if the total size of all domains in the candidate list is less than the requested amount of memory to release, it can potentially lead to Out-of-Memory (OOM) errors or memory swapping.
  3. Biased Candidate Selection: There's a possibility of repeatedly selecting the same domains and nodes if they are frequently restored in memory between consecutive eviction cycles.
  4. Cache Trashing: As mentioned in point 3, the biased candidate selection can result in cache trashing when data is continuously evicted and read back.
  5. Memory Miscalculation: The algorithm does not account for memory allocated by RocksDB, which is reported by jemalloc as part of the process memory usage.
  6. Domain Size Adjustment Bias: Domains that host wide tables and narrow tables often require different size adjustments. Currently, there may be overestimation for wide tables and underestimation for narrow tables. Narrow tables, with their narrow rows, typically incur additional memory overhead compared to wide rows.
  7. Hardcoded Heuristics: The algorithm relies on hardcoded values, such as the candidate list size and size-based filters (e.g., 10%).
  8. Overly Aggressive Eviction: The current algorithm tends to evict more data than necessary due to overestimation of large domain sizes. Additionally, it does not take into account downstream evictions.

    Addressing these shortcomings is essential to enhance the efficiency, reliability, and accuracy of the eviction algorithm.

VladRodionov commented 1 year ago

Data Eviction in a Cache

Caching is a fundamental technique used in computing to optimize data access and retrieval. It involves storing frequently accessed data in a high-speed memory, typically faster but smaller than the main data storage, for quick and efficient retrieval. While caching greatly enhances performance, it introduces the challenge of managing limited cache space effectively. Data eviction, the process of deciding which data to remove from the cache to make room for new data, plays a pivotal role in ensuring the cache functions optimally.

  1. Resource Efficiency: Cache memory is finite and expensive, making efficient use of this resource crucial. Without proper data eviction strategies, caches can quickly become cluttered with obsolete or less frequently used data, wasting valuable memory space.
  2. Performance Optimization: The primary purpose of caching is to accelerate data access. Evicting less relevant or stale data ensures that the cache remains filled with the most frequently accessed and relevant data, leading to faster retrieval times and improved system performance.
  3. Content Freshness: In many applications, data in the cache must be up-to-date. Eviction mechanisms help maintain data freshness by removing outdated information, ensuring that users receive the most current data when requested.
  4. Preventing Cache Pollution: If cache eviction is neglected, it can lead to cache pollution, where rarely accessed data fills the cache and displaces more important, frequently used data. This can significantly degrade system performance.
  5. Adaptation to Changing Access Patterns: Access patterns to data can change over time. Effective eviction strategies can adapt to these changes, ensuring that the cache continues to store the most relevant data as access patterns evolve.
  6. Mitigating Cache Thrashing: Cache thrashing occurs when data is repeatedly evicted and then reloaded into the cache, causing unnecessary overhead. Proper eviction policies help mitigate thrashing by retaining data with more stable access patterns.
  7. Balancing Resources: Data eviction is a trade-off between optimizing cache hit rates and minimizing cache misses. Striking the right balance is critical to maximizing the benefits of caching while minimizing resource consumption.

In conclusion, data eviction is a critical component of caching systems that directly impacts system performance, resource utilization, and user experience. Well-designed eviction policies are essential for ensuring that caches remain efficient, responsive, and adaptive to changing data access patterns, ultimately contributing to the overall effectiveness of a system.

Ideally, a good eviction algorithm should meet all seven of these requirements. However, in practice, there is often a trade-off between functionality, performance, and complexity.

To be continued …

VladRodionov commented 1 year ago

There is some description in http://docs/dataflow/memory_management.html if it is useful to use as a starting point.