Explore Leanstore's design to improve pagecache

jon-chuang commented 3 years ago

Proposed Change:

The major changes are as follows:

Pointer swizzling (seems like its already in progress: https://github.com/spacejam/sled/pull/1304)
Rather than maintaining LRU/LFU cache design, use random eviction to a "cold" in-memory FIFO which restores to the hot cache when page is accessed. This avoids the need for lock sync across LRU shards, and expensive per-item cost (hashmap - extra indirection, reordering of linked list). This can be offered as a drop-in replacement for Lru.
Leverage direct I/O (can't tell if this is already done, even though I know io_uring is used for writes (but why not reads...?)).
Optimistic concurrency control: this is already the selling point of sled...

Thus the major outstanding point (apart from optimistic CC) is probably lean eviction. It accounts for a 4-5x factor increase in multi-thread throughput in the leanstore paper.

Just think... Doesn't that mean possibly, 5B ops in under a minute?

Notes: 90% of cache hits: pointer swizzle. No indirection/no cost...! 10% - almost a pointer swizzle! Additional cost of moving the page back to the main buffer, possibly causing an unswizzling.

Leanstore

spacejam commented 3 years ago

Hey @jon-chuang, thank you for opening the issue! Indeed, I'm currently working on pointer swizzling, which has caused a lot of architectural changes that I've needed to step back from a few times and model more carefully, but it's moving along, and I hope to spend a bit of effort stabilizing some of these changes through intensive testing before cutting a new release that includes them. Pointer swizzling opens up so many cool opportunities all over the place.

In general, caching is a major area I'd like to focus on over the next year. I'm curious to evaluate some trial versions of the LeanStore FIFO, and to see if there are aspects that combine nicely with techniques from w-TinyLFU or in some cases clock (since generating metadata snapshots walks the page table anyway, we may benefit from piggy-backing some work onto this iteration).

io_uring is currently only enabled with a feature flag, but over time I think I'll get rid of the rio dependency and make sled's io_uring usage fully internal, so I can cut out a lot of the unnecessary code and tune things more for sled's specific use case. After that happens, it will also be used for scatter-gathering read fragments.

jon-chuang commented 3 years ago

Hey @spacejam, thanks for the insightful replies. Some thoughts:

TinyLFU: I suppose maintaining a lightweight count can help during the random eviction process - similar to rejection sampling. A page with “high” access count is less likely to be unswizzled into the cold Fifo. One maintains a global struct of access count statistics (probably just an f64 average) during the random walk to determine whether to reject an eviction opportunity. Victor Leis also mentions in his recent talk the potential for leveraging such extra usage metadata.
To take this idea further one can use some of the ideas from Tiny LFU about using bloom filters to reduce the evictions of hot pages. I think the intuition here is that one wants to have a lightweight data structure for preventing evictions of 10% of hot pages, and a FIFO that accumulates and evicts the bottom 10%, more or less probabilistically. We’ll see below the cost of this - like traditional LFU, the overhead comes when choosing which pages to evict.
Nice properties of the LeanStore buffer pool: swips are owned by a single owner (the parent). So rust is perfect for enforcing this invariant.
re: materialized views via io_uring: I'm a fan of that. Relatedly, I'd like to implement batch read APIs that hide the overall latency while being lightweight in threads, via async/await, to handle the high page fault scenario gracefully by not blocking.

Here are some ambiguities I have about design:

Resizeable cache and memory reclamation: In the same talk linked above, Leis also mentions that simply by maintaining page update timestamping, to avoid read after free, one can avoid epoch-based garbage collection, by having a worker thread maintain a thread-local “deleted” hashmap. However, that’s if you never relinquish the buffer memory to the OS.
Unfortunately, this doesn’t play well with my idea of a dynamic cache size. To proceed in this direction, perhaps page buffers should be organised into partitions. And a partition has to be reclaimed before releasing back to the OS. This is a fairly expensive operation. Unfortunately, as pointers are not doubly linked, it is not possible to unswizzle all pointers pointing into a particular partition... To handle this gracefully, one ought to have a hierarchy of partitions, such that all the pointers into a partition are contained in at most one (or a small number) of other partitions. Then epoch based reclamation ought to still be pretty cheap and efficient at freeing up partitions back to the OS. This is perhaps fine for a naive B+Tree, however, Bw-Tree pointers generally don't follow such a nice locality property. A single page may contain delta update entries from across the entire tree.
Maintain a hashmap of in-flight I/O? The only functionality of that v.s. Queues is to dedup I/O requests. Might not be worth it. (However, LeanStore paper mentions that dedup must be done for correctness purposes).
Should FIFO be implemented as a linked hashmap? To further prevent lock contention can we use a sharded hashmap with a linked list in each queue? (Similar to the current Lru, just with a different policy for access - reswizzling, rather than requeueing). So thinking about this, it does make sense that one could get a 4-5x speedup - if 10% of time is spent in the linked hashmap FIFO, and 90% of the time is just chasing swizzled pointers. This works out if the latter if 10% of the cost of the former. It suggests then that given a 10x improvement from pointer swizzling, even if one were to reduce cold buffer for hot pages to 1%, the max speedup would be 10x.
What would be the ideal partition size for the buffer?
Operation aware cache mechanism. Scans should place read pages into the cold FIFO rather than hot buffer to prevent cache pollution.

So to clarify: The cost of an eviction to cold buffer (incurred for ~10% of hot pages, less if leveraging probabilistic LFU methods like TinyLFU. Let’s aim for… 1%?):

When encountering an unswizzled pointer: lookup the sharded hashmap. If None, the thread/async task serializes the I/O operation into a dedup hashmap, which has an SSD I/O translation layer for pageid->(fileid, offset) and then the I/O queue. If Some(Page), remove the page from FIFO and the cold buffer hashmap, and reswizzle the pointer.
Swizzling new pointers: If requesting new pages or swizzling a page from disk, first check if cold buffer is below threshold. If not, first check the local deletion cache. If possible, first reclaim freed pages. Else, simultaneously unswizzle a page via the probabilistic method vaguely described above (the choice made by LeanStore is not to run this unswizzling async for the reason of it being hard to tune).

ben-manes commented 3 years ago

@jon-chuang: before going too deep into implementing an eviction policy, I'd recommend writing an implementation using Caffeine's simulator. This way you can use existing traces or replay your own to compare hit rates across a variety of workloads. As the code is only the raw algorithm (e.g. no concurrency control logic), it will be faster to iterate on until you are satisfied with the tradeoffs. A thorough analysis should help show if your design ideas are robust against scans, pollution, etc before @spacejam dives into the more complex implementation that handles all the other needs of this project.

spacejam / sled

Explore Leanstore's design to improve pagecache #1346

Proposed Change: