in-memory LockFreeBtree<K, V>

ghost commented 7 years ago

I'd be very interested in including a lock-free BTreeMap in Crossbeam at some point in the future. I gave this topic a lot of thought, and have been intending to start implement a Bw-Tree for a long time, so let's share notes. If possible, we should probably avoid duplicating our efforts. :)

I think it's possible to abstract a Bw-Tree into a separate crate, and then expose a completely in-memory API for use as a concurrent BTreeMap (and provide it as the crossbeam-bwtree crate), and finally plug it into the rest of rsdb as the index engine. I understand that your project is already very ambitious and you have limited time resources, but would you be interested in working towards this goal?

Here are some of my thoughts on a completely in-memory Bw-Tree (let's call it BwTreeMap<K, V>). Please consider everything I'm writing as just sketches and ideas I've been keeping in my mind rather than a solid well-thought-out plan.

1. Types

In the standard library we have BTreeMap<K: Ord, V>, and ideally we'd like to have a data structure that behaves very similarly and provides a similar API, but is lock-free and allows concurrent mutation. Since Bw-Tree is a B+-Tree, we have additional bound: K: Clone.

I noticed that in rsdb keys are of type Vec<u8> (perhaps they could be even abstracted as K: AsRef<[u8]>) - essentially, just byte arrays. Is there a reason why you're limiting the API to byte arrays only? I'm aware that there are numerous benefits to representing keys as byte arrays in databases, but would it hurt to generalize the key type?

2. Page Table

In rsdb you implemented a radix tree as the page table storage. Page table is one of the crucial tricks behind the Bw-Tree, and it took me a long time to realize that having the table as an array, hash table, or some kind of tree is completely unnecessary.

In BwTreeMap we don't really need a complicated and memory hungry (and slow!) page mapping data structure to manage pages. Page table is very useful when there is a need for moving pages between storage layers (memory and disk), but otherwise it's just a level of indirection that enables some lock-free operations.

In BwTreeMap, instead of inserting a new ID into the page mapping table, we'd simply allocate an atomic pointer on the heap (akin to Box<AtomicPtr<Node>>), which would act as a node in the tree pointing to another node. Similarly, instead of removing a slot from the table, we'd deallocate this node from the heap.

In other words, IDs are just pointers to those special nodes. And the mapping from the ID to another node is written inside the heap-allocated node.

I think the Bw-Tree should allow configuring the page table implementation (the API might be: BwTree<K: Key, V: Value, PT: PageTable>). Then, BwTreeMap would use a simple "page table" that just allocates and deallocates nodes on the heap, and rsdb would use something more complicated (a radix tree?).

3. Iterators

Unlike the typical quick operations on a Bw-Tree (e.g. insert/search/remove), iterators may be long-lived. This is a problem in the context of memory reclamation. Holding a thread pinned for the entire duration of iteration might be blocking garbage collection for an unacceptably long time. I wonder how you're going to solve this problem.

My idea was to keep a reference count in each node of the tree. That would allow us to hold a reference to a node even after unpinning the current thread. The key idea is that we increment the reference count of a node just before unpinning, and then use the node outside the pinned region of code. All nods start with the reference count of one. When removing a node from the tree we just decrement the count. Whoever decrements the count to zero has to add the node as a piece of garbage to the epoch GC.

4. Thoughts on ART

Finally, I wonder what you think about adaptive radix trees. In my benchmarks, an ART is typically several times faster than a B-tree for insert/search/delete, and several times slower for iteration. If your keys are byte arrays, ART can be a fantastic data structure. Having an ART in Crossbeam is another long-term goal of mine.

Servo uses a concurrent string cache for interning (it's just a cache of strings that maps every unique string to a unique integer ID), which is based on a sharded mutexes-protected hash table. I think some performance gains might be achieved by switching to ART.

ART is probably not a good fit for rsdb as the index data structure, though, since it's more difficult to seamlessly move nodes between memory and disk, keys have to be byte arrays, iteration is slightly worse, and it's not strictly speaking lock-free. But I'm still curious if you have any thoughts on this. :)

Is your page mapping table perhaps going to be an ART?

cc @aturon - In case you are interested, this is the cool database project I've mentioned to you last week, and Bw-Tree is the concurrent B+-tree behind it.

spacejam commented 7 years ago

Background: I began this project as a part of my larger (currently idle) rasputin distributed database with the primary goal of providing a super fast persistent KV store that I am familiar enough with to bend to the whims of problems I face in stateful distributed systems, as well as finally getting a persistent db for the rust community with a high focus on reliability.

I'm super interested in collaborating on a lock-free tree for crossbeam!

RE: Vec<u8>. This is purely a consequence of this starting with the focus of being a simple persistent key-value store, although it would be possible to alter the types to be K: Ord + Clone + Deserialize + Serialize and V: Deserialize + Serialize. If I had started with the goal of a non-persistent tree, Tree<K: Ord + Clone, V> would be much more appropriate!
The pagetable is only necessary in the context of a pagecache. jemalloc already gives us the necessary lock-free radix tree for purely in-memory workloads :) All references to PageID can basically be replaced with a Ptr<Stack<tree::Frag>> and you can rip out all of the Radix, CacheEntry, log-related stuff! Calls to PageCache::merge (being renamed from prepend) can be replaced with a Stack::cap and calls to PageCache::set (being renamed from replace) can be replaced with Stack::cas. I think it should be a really straightforward gutting. One important thing to note is I've skipped node merging in the tree implementation so far, so that is a small-medium piece of work that remains after the fast gutting.

For this to be done in a generic way that plays with the underlying pagecache, we just need to be able to choose either the in-memory straight-to-Ptr<Stack<tree::Frag>> or the persistent PageCache::... call and have the reference type be generic (Ptr or PageID). And the PageCache (with the same BLinkMaterializer) is usable as long as the K and V types are also Serialize + Deserialize.

I've been punting on the issue of long-running scans. I've just thought for an hour or so about the interactions between iteration, RC, and merges/splits, jumping from skepticism to over-optimization to agreement:

~~At first I was thinking we needed to start the RC at 2 since we have side pointers, then decrement by the threads that complete phases 2 and 3.~~
~~But I think we can skip the RC and just defer_drop by the thread that completes phase 3 of the merge? No, then the iterators may have references to dropped memory.~~
As long as the iterators follow the same cooperative-completion-then-jump path when hitting merge/split deltas then I think the node-level RC should work!

I like this for supporting long iterations directly on the tree! A single RC on a wide node is pretty low overhead. This sounds great to me after thinking about it!

to be honest I'm not very familiar with ART's! Thanks for that info about them! They sound really interesting, and I'm going to do some research, and see if I could apply them to this somewhere.

So, I'm definitely up for combining forces for a crossbeam-bwtree project with a pluggable pagecache! How would you like to move forward?

ghost commented 7 years ago

How would you like to move forward?

I was thinking about this for a long time, but I'm still not really sure how to move forward, at least not right now. :) I'd just suggest keeping in mind that we eventually want to expose the tree as a purely in-memory data structure.

My plan is to experiment a little bit and try coding up a simple prototype of a Bw-Tree, focusing more on the low-level implementation details than the high level tree design (consolidations, supporting all operations, etc.). More specifically, I'm trying to answer questions like:

How to implement iterators?
What would the exact API be?
Do we want Clone or Copy bounds on keys/values? Can we avoid them altogether?
What do we need to do to in order make K: Ord safe even with broken Ord implementation?
How to allocate dynamically-sized nodes with as few levels of indirection as possible?

spacejam commented 7 years ago

@stjepang awesome, I'm really looking forward to seeing what you come up with!

RE: dynamic nodes with few indirections, it may be hard to beat a Vec<u8> where the first N bytes represent a header referencing various interesting subfields. We don't care about page alignment (at the log level we do when we use O_DIRECT, but we can more or less ignore it up in the tree for the medium term). This gets a little uglier with non-Vec keys and values, but still feasible. I'll implement this soonish in this crate I think.

For non-clone/copy keys, this may not be the way forward, as you'll need a reference to a single instance. Also keep in mind that the inserted keys that were the source of node min/max bounds may have been deleted. So you may need a ref count on the underlying key. Also there's a choice of how you handle re-inserting an equivalent key. Do you keep the indices and node bounds referring to a different, equivalent key objects, or do you try to coalesce these somehow?

I haven't thought of a better solution to the iterator problem than your RC idea above. I'll definitely let you know if something else pops into my head though, but I think that approach will give you some good mileage.

Broken Ord sounds like a nightmare haha... Very much looking forward to seeing what you come up with :)

ghost commented 7 years ago

Another paper you might find interesting: A Comparison of Adaptive Radix Trees and Hash Tables.

It compares the performance of adaptive radix tree, hash table, judy array, and B+ tree.

spacejam commented 7 years ago

@stjepang ahh cool, thanks for the paper! The current naive radix implementation is getting more and more significant on the flamegraphs, and pretty soon I think I'll give an ART implementation a go! This paper will be really useful for that. The current one has like a 1:64 pointer:data ratio and even though the traversal / insertion code is fairly minimal and probably nice to the L0 code cache, I wonder how much that node bloat messes up the overall performance.

jeehoonkang commented 6 years ago

For supporting long-lived pointers such as iterators in Crossbeam, I have implemented a very preliminary version of a hybrid of EBR (epoch-based reclamation) and HP (hazard pointers). The high-level idea of this hybrid is using HP for long-lived pointers and EBR for short-lived pointers so that long-lived ones do not hinder the epoch advancement, while short-lived ones enjoy short access latency. This hybrid is very much inspired by Snowflake.

Though I'm not sure HP is actually helpful in the iteration workloads. First, maybe it's not costly to count the references to a node. I'd like to produce some benchmark results. Second, HP protects only a handful of pointers, while a user may want to iterate over a large set of data. In reference-counting, you're going to mark all the nodes first and then to iterate over them later, right?

ghost commented 6 years ago

Second, HP protects only a handful of pointers, while a user may want to iterate over a large set of data. In reference-counting, you're going to mark all the nodes first and then to iterate over them later, right?

Not really. The idea is to increment and decrement reference counts as you go from one node to another.

The same applies to hazard pointers. For example, if you want to iterate over a long chain of 1000 nodes, you only need two hazard pointers to do that.

spacejam commented 6 years ago

Hey @stjepang, have you played around with a lock-free ART in rust yet? I'm starting to think more about using one to replace the naive radix tree here in sled.

ghost commented 6 years ago

I haven't, but my friend implemented this, which should be a good starting point for a concurrent ART.

spacejam commented 6 years ago

@stjepang ahh cool, I pinged andy pavlo a couple days ago, and he was kind enough to provide some recommendations. He also mentioned the ART, and I've kicked off a few mental threads around it.

So, there is a pattern in use in sled, inspired by the LLAMA paper, where you can get lock-free data structures synchronized with a log-structured persistent store, in terms of linearizing updates, where we first reserve a certain amount of space in the log, then do a CAS on the in-memory structure, and then based on the success of the CAS either write a "failed flush" into the log buffer or the actual data for the update. I have been wondering about the performance of doing this to create a persistent ART index for point queries.

jeehoonkang commented 6 years ago

@spacejam In fact, one of my mid-term goals is pagecache + persistent memory :) I really enjoyed Andy's recent work, and glad that you already contacted him!

I'm also thinking of in-memory lock-free ART. It is very.. intriguing, but I hope we will find a way. I haven't thought of persistent ART, but it'll be very interesting if the page index for a log-structured storage (LSS) is also persisted in the LSS itself! I'm curious how to break this seemingly circularity, though.

spacejam commented 6 years ago

@jeehoonkang that sounds awesome! If you're going for optimization, these two tunables are probably the biggest ones to play with for your workloads:

segment_cleanup_threshold when to start copying fragments to a new segment, 0-1, where 0.5 means it will start copying page fragments to new segments once the segment has 50% of its data relocated elsewhere
page_consolidation_threshold how long the chain of deltas needs to get before rewriting the compacted page somewhere

For hosting on the LSS, that's what sled is doing now. Even if the checkpoint gets deleted, it will scan the storage, record which page fragments are stored at which log offsets, and using the recover functionality in the BLinkMaterializer to track root page changes. Then when requests come in, just start at the root page, and use the recorded log offsets to materialize pages while working down the tree. I think the same approach could be used directly for a persistent ART. I'm skeptical about the persistent version being able to beat the Bw-tree in terms of space efficiency or read latency when not loaded into memory completely, but if we don't evict nodes from cache it might be good. In any case, it would probably not be the hardest thing to build a prototype of since pagecache is now split out :)

ghost commented 6 years ago

An interesting new paper was published:

Building a Bw-Tree Takes More Than Just Buzz Words

spacejam commented 5 years ago

While this seems interesting and possibly something I'd like to do in the future, I'm going to close this because it's not relevant to the sled project itself. FWIW it's possibly to create this by basically just chopping out the pagetable completely in sled, and this might even be a nice exercise to see how small the tree implementation can become when it doesn't have a pagecache underneath to broker atomic operations.

spacejam commented 11 months ago

implemented as the concurrent-map crate :)

spacejam commented 11 months ago

and non-concurrent fixed-length art: art

spacejam / sled