spacejam / sled

the champagne of beta embedded databases
Apache License 2.0
8.08k stars 382 forks source link

Explore Leanstore's design to improve pagecache #1346

Open jon-chuang opened 3 years ago

jon-chuang commented 3 years ago

Proposed Change:

The major changes are as follows:

Thus the major outstanding point (apart from optimistic CC) is probably lean eviction. It accounts for a 4-5x factor increase in multi-thread throughput in the leanstore paper.

Just think... Doesn't that mean possibly, 5B ops in under a minute?

Notes: 90% of cache hits: pointer swizzle. No indirection/no cost...! 10% - almost a pointer swizzle! Additional cost of moving the page back to the main buffer, possibly causing an unswizzling.

Leanstore

spacejam commented 3 years ago

Hey @jon-chuang, thank you for opening the issue! Indeed, I'm currently working on pointer swizzling, which has caused a lot of architectural changes that I've needed to step back from a few times and model more carefully, but it's moving along, and I hope to spend a bit of effort stabilizing some of these changes through intensive testing before cutting a new release that includes them. Pointer swizzling opens up so many cool opportunities all over the place.

In general, caching is a major area I'd like to focus on over the next year. I'm curious to evaluate some trial versions of the LeanStore FIFO, and to see if there are aspects that combine nicely with techniques from w-TinyLFU or in some cases clock (since generating metadata snapshots walks the page table anyway, we may benefit from piggy-backing some work onto this iteration).

io_uring is currently only enabled with a feature flag, but over time I think I'll get rid of the rio dependency and make sled's io_uring usage fully internal, so I can cut out a lot of the unnecessary code and tune things more for sled's specific use case. After that happens, it will also be used for scatter-gathering read fragments.

jon-chuang commented 3 years ago

Hey @spacejam, thanks for the insightful replies. Some thoughts:

Here are some ambiguities I have about design:

  1. Resizeable cache and memory reclamation: In the same talk linked above, Leis also mentions that simply by maintaining page update timestamping, to avoid read after free, one can avoid epoch-based garbage collection, by having a worker thread maintain a thread-local “deleted” hashmap. However, that’s if you never relinquish the buffer memory to the OS.
  2. Unfortunately, this doesn’t play well with my idea of a dynamic cache size. To proceed in this direction, perhaps page buffers should be organised into partitions. And a partition has to be reclaimed before releasing back to the OS. This is a fairly expensive operation. Unfortunately, as pointers are not doubly linked, it is not possible to unswizzle all pointers pointing into a particular partition... To handle this gracefully, one ought to have a hierarchy of partitions, such that all the pointers into a partition are contained in at most one (or a small number) of other partitions. Then epoch based reclamation ought to still be pretty cheap and efficient at freeing up partitions back to the OS. This is perhaps fine for a naive B+Tree, however, Bw-Tree pointers generally don't follow such a nice locality property. A single page may contain delta update entries from across the entire tree.
  3. Maintain a hashmap of in-flight I/O? The only functionality of that v.s. Queues is to dedup I/O requests. Might not be worth it. (However, LeanStore paper mentions that dedup must be done for correctness purposes).
  4. Should FIFO be implemented as a linked hashmap? To further prevent lock contention can we use a sharded hashmap with a linked list in each queue? (Similar to the current Lru, just with a different policy for access - reswizzling, rather than requeueing). So thinking about this, it does make sense that one could get a 4-5x speedup - if 10% of time is spent in the linked hashmap FIFO, and 90% of the time is just chasing swizzled pointers. This works out if the latter if 10% of the cost of the former. It suggests then that given a 10x improvement from pointer swizzling, even if one were to reduce cold buffer for hot pages to 1%, the max speedup would be 10x.
  5. What would be the ideal partition size for the buffer?
  6. Operation aware cache mechanism. Scans should place read pages into the cold FIFO rather than hot buffer to prevent cache pollution.

So to clarify: The cost of an eviction to cold buffer (incurred for ~10% of hot pages, less if leveraging probabilistic LFU methods like TinyLFU. Let’s aim for… 1%?):

  1. When encountering an unswizzled pointer: lookup the sharded hashmap. If None, the thread/async task serializes the I/O operation into a dedup hashmap, which has an SSD I/O translation layer for pageid->(fileid, offset) and then the I/O queue. If Some(Page), remove the page from FIFO and the cold buffer hashmap, and reswizzle the pointer.
  2. Swizzling new pointers: If requesting new pages or swizzling a page from disk, first check if cold buffer is below threshold. If not, first check the local deletion cache. If possible, first reclaim freed pages. Else, simultaneously unswizzle a page via the probabilistic method vaguely described above (the choice made by LeanStore is not to run this unswizzling async for the reason of it being hard to tune).
ben-manes commented 3 years ago

@jon-chuang: before going too deep into implementing an eviction policy, I'd recommend writing an implementation using Caffeine's simulator. This way you can use existing traces or replay your own to compare hit rates across a variety of workloads. As the code is only the raw algorithm (e.g. no concurrency control logic), it will be faster to iterate on until you are satisfied with the tradeoffs. A thorough analysis should help show if your design ideas are robust against scans, pollution, etc before @spacejam dives into the more complex implementation that handles all the other needs of this project.