Closed spacejam closed 5 years ago
Background: I began this project as a part of my larger (currently idle) rasputin distributed database with the primary goal of providing a super fast persistent KV store that I am familiar enough with to bend to the whims of problems I face in stateful distributed systems, as well as finally getting a persistent db for the rust community with a high focus on reliability.
I'm super interested in collaborating on a lock-free tree for crossbeam!
RE: Vec<u8>
. This is purely a consequence of this starting with the focus of being a simple persistent key-value store, although it would be possible to alter the types to be K: Ord + Clone + Deserialize + Serialize
and V: Deserialize + Serialize
. If I had started with the goal of a non-persistent tree, Tree<K: Ord + Clone, V>
would be much more appropriate!
The pagetable is only necessary in the context of a pagecache. jemalloc already gives us the necessary lock-free radix tree for purely in-memory workloads :) All references to PageID
can basically be replaced with a Ptr<Stack<tree::Frag>>
and you can rip out all of the Radix
, CacheEntry
, log-related stuff! Calls to PageCache::merge
(being renamed from prepend
) can be replaced with a Stack::cap
and calls to PageCache::set
(being renamed from replace
) can be replaced with Stack::cas
. I think it should be a really straightforward gutting. One important thing to note is I've skipped node merging in the tree implementation so far, so that is a small-medium piece of work that remains after the fast gutting.
For this to be done in a generic way that plays with the underlying pagecache, we just need to be able to choose either the in-memory straight-to-Ptr<Stack<tree::Frag>>
or the persistent PageCache::...
call and have the reference type be generic (Ptr
or PageID
). And the PageCache (with the same BLinkMaterializer) is usable as long as the K and V types are also Serialize + Deserialize
.
defer_drop
by the thread that completes phase 3 of the merge? No, then the iterators may have references to dropped memory.I like this for supporting long iterations directly on the tree! A single RC on a wide node is pretty low overhead. This sounds great to me after thinking about it!
So, I'm definitely up for combining forces for a crossbeam-bwtree
project with a pluggable pagecache! How would you like to move forward?
How would you like to move forward?
I was thinking about this for a long time, but I'm still not really sure how to move forward, at least not right now. :) I'd just suggest keeping in mind that we eventually want to expose the tree as a purely in-memory data structure.
My plan is to experiment a little bit and try coding up a simple prototype of a Bw-Tree, focusing more on the low-level implementation details than the high level tree design (consolidations, supporting all operations, etc.). More specifically, I'm trying to answer questions like:
Clone
or Copy
bounds on keys/values? Can we avoid them altogether?K: Ord
safe even with broken Ord
implementation?@stjepang awesome, I'm really looking forward to seeing what you come up with!
RE: dynamic nodes with few indirections, it may be hard to beat a Vec<u8>
where the first N bytes represent a header referencing various interesting subfields. We don't care about page alignment (at the log level we do when we use O_DIRECT, but we can more or less ignore it up in the tree for the medium term). This gets a little uglier with non-Vec
For non-clone/copy keys, this may not be the way forward, as you'll need a reference to a single instance. Also keep in mind that the inserted keys that were the source of node min/max bounds may have been deleted. So you may need a ref count on the underlying key. Also there's a choice of how you handle re-inserting an equivalent key. Do you keep the indices and node bounds referring to a different, equivalent key objects, or do you try to coalesce these somehow?
I haven't thought of a better solution to the iterator problem than your RC idea above. I'll definitely let you know if something else pops into my head though, but I think that approach will give you some good mileage.
Broken Ord
sounds like a nightmare haha... Very much looking forward to seeing what you come up with :)
Another paper you might find interesting: A Comparison of Adaptive Radix Trees and Hash Tables.
It compares the performance of adaptive radix tree, hash table, judy array, and B+ tree.
@stjepang ahh cool, thanks for the paper! The current naive radix implementation is getting more and more significant on the flamegraphs, and pretty soon I think I'll give an ART implementation a go! This paper will be really useful for that. The current one has like a 1:64 pointer:data ratio and even though the traversal / insertion code is fairly minimal and probably nice to the L0 code cache, I wonder how much that node bloat messes up the overall performance.
For supporting long-lived pointers such as iterators in Crossbeam, I have implemented a very preliminary version of a hybrid of EBR (epoch-based reclamation) and HP (hazard pointers). The high-level idea of this hybrid is using HP for long-lived pointers and EBR for short-lived pointers so that long-lived ones do not hinder the epoch advancement, while short-lived ones enjoy short access latency. This hybrid is very much inspired by Snowflake.
Though I'm not sure HP is actually helpful in the iteration workloads. First, maybe it's not costly to count the references to a node. I'd like to produce some benchmark results. Second, HP protects only a handful of pointers, while a user may want to iterate over a large set of data. In reference-counting, you're going to mark all the nodes first and then to iterate over them later, right?
Second, HP protects only a handful of pointers, while a user may want to iterate over a large set of data. In reference-counting, you're going to mark all the nodes first and then to iterate over them later, right?
Not really. The idea is to increment and decrement reference counts as you go from one node to another.
The same applies to hazard pointers. For example, if you want to iterate over a long chain of 1000 nodes, you only need two hazard pointers to do that.
Hey @stjepang, have you played around with a lock-free ART in rust yet? I'm starting to think more about using one to replace the naive radix tree here in sled.
I haven't, but my friend implemented this, which should be a good starting point for a concurrent ART.
@stjepang ahh cool, I pinged andy pavlo a couple days ago, and he was kind enough to provide some recommendations. He also mentioned the ART, and I've kicked off a few mental threads around it.
So, there is a pattern in use in sled, inspired by the LLAMA paper, where you can get lock-free data structures synchronized with a log-structured persistent store, in terms of linearizing updates, where we first reserve a certain amount of space in the log, then do a CAS on the in-memory structure, and then based on the success of the CAS either write a "failed flush" into the log buffer or the actual data for the update. I have been wondering about the performance of doing this to create a persistent ART index for point queries.
@spacejam In fact, one of my mid-term goals is pagecache + persistent memory :) I really enjoyed Andy's recent work, and glad that you already contacted him!
I'm also thinking of in-memory lock-free ART. It is very.. intriguing, but I hope we will find a way. I haven't thought of persistent ART, but it'll be very interesting if the page index for a log-structured storage (LSS) is also persisted in the LSS itself! I'm curious how to break this seemingly circularity, though.
@jeehoonkang that sounds awesome! If you're going for optimization, these two tunables are probably the biggest ones to play with for your workloads:
segment_cleanup_threshold
when to start copying fragments to a new segment, 0-1, where 0.5 means it will start copying page fragments to new segments once the segment has 50% of its data relocated elsewherepage_consolidation_threshold
how long the chain of deltas needs to get before rewriting the compacted page somewhereFor hosting on the LSS, that's what sled is doing now. Even if the checkpoint gets deleted, it will scan the storage, record which page fragments are stored at which log offsets, and using the recover functionality in the BLinkMaterializer to track root page changes. Then when requests come in, just start at the root page, and use the recorded log offsets to materialize pages while working down the tree. I think the same approach could be used directly for a persistent ART. I'm skeptical about the persistent version being able to beat the Bw-tree in terms of space efficiency or read latency when not loaded into memory completely, but if we don't evict nodes from cache it might be good. In any case, it would probably not be the hardest thing to build a prototype of since pagecache is now split out :)
An interesting new paper was published:
While this seems interesting and possibly something I'd like to do in the future, I'm going to close this because it's not relevant to the sled project itself. FWIW it's possibly to create this by basically just chopping out the pagetable completely in sled, and this might even be a nice exercise to see how small the tree implementation can become when it doesn't have a pagecache underneath to broker atomic operations.
implemented as the concurrent-map crate :)
I'd be very interested in including a lock-free
BTreeMap
in Crossbeam at some point in the future. I gave this topic a lot of thought, and have been intending to start implement a Bw-Tree for a long time, so let's share notes. If possible, we should probably avoid duplicating our efforts. :)I think it's possible to abstract a Bw-Tree into a separate crate, and then expose a completely in-memory API for use as a concurrent
BTreeMap
(and provide it as thecrossbeam-bwtree
crate), and finally plug it into the rest of rsdb as the index engine. I understand that your project is already very ambitious and you have limited time resources, but would you be interested in working towards this goal?Here are some of my thoughts on a completely in-memory Bw-Tree (let's call it
BwTreeMap<K, V>
). Please consider everything I'm writing as just sketches and ideas I've been keeping in my mind rather than a solid well-thought-out plan.1. Types
In the standard library we have
BTreeMap<K: Ord, V>
, and ideally we'd like to have a data structure that behaves very similarly and provides a similar API, but is lock-free and allows concurrent mutation. Since Bw-Tree is a B+-Tree, we have additional bound:K: Clone
.I noticed that in rsdb keys are of type
Vec<u8>
(perhaps they could be even abstracted asK: AsRef<[u8]>
) - essentially, just byte arrays. Is there a reason why you're limiting the API to byte arrays only? I'm aware that there are numerous benefits to representing keys as byte arrays in databases, but would it hurt to generalize the key type?2. Page Table
In rsdb you implemented a radix tree as the page table storage. Page table is one of the crucial tricks behind the Bw-Tree, and it took me a long time to realize that having the table as an array, hash table, or some kind of tree is completely unnecessary.
In
BwTreeMap
we don't really need a complicated and memory hungry (and slow!) page mapping data structure to manage pages. Page table is very useful when there is a need for moving pages between storage layers (memory and disk), but otherwise it's just a level of indirection that enables some lock-free operations.In
BwTreeMap
, instead of inserting a new ID into the page mapping table, we'd simply allocate an atomic pointer on the heap (akin toBox<AtomicPtr<Node>>
), which would act as a node in the tree pointing to another node. Similarly, instead of removing a slot from the table, we'd deallocate this node from the heap.In other words, IDs are just pointers to those special nodes. And the mapping from the ID to another node is written inside the heap-allocated node.
I think the Bw-Tree should allow configuring the page table implementation (the API might be:
BwTree<K: Key, V: Value, PT: PageTable>
). Then,BwTreeMap
would use a simple "page table" that just allocates and deallocates nodes on the heap, and rsdb would use something more complicated (a radix tree?).3. Iterators
Unlike the typical quick operations on a Bw-Tree (e.g. insert/search/remove), iterators may be long-lived. This is a problem in the context of memory reclamation. Holding a thread pinned for the entire duration of iteration might be blocking garbage collection for an unacceptably long time. I wonder how you're going to solve this problem.
My idea was to keep a reference count in each node of the tree. That would allow us to hold a reference to a node even after unpinning the current thread. The key idea is that we increment the reference count of a node just before unpinning, and then use the node outside the pinned region of code. All nods start with the reference count of one. When removing a node from the tree we just decrement the count. Whoever decrements the count to zero has to add the node as a piece of garbage to the epoch GC.
4. Thoughts on ART
Finally, I wonder what you think about adaptive radix trees. In my benchmarks, an ART is typically several times faster than a B-tree for insert/search/delete, and several times slower for iteration. If your keys are byte arrays, ART can be a fantastic data structure. Having an ART in Crossbeam is another long-term goal of mine.
Servo uses a concurrent string cache for interning (it's just a cache of strings that maps every unique string to a unique integer ID), which is based on a sharded mutexes-protected hash table. I think some performance gains might be achieved by switching to ART.
ART is probably not a good fit for rsdb as the index data structure, though, since it's more difficult to seamlessly move nodes between memory and disk, keys have to be byte arrays, iteration is slightly worse, and it's not strictly speaking lock-free. But I'm still curious if you have any thoughts on this. :)
Is your page mapping table perhaps going to be an ART?
cc @aturon - In case you are interested, this is the cool database project I've mentioned to you last week, and Bw-Tree is the concurrent B+-tree behind it.