Open JohannesLichtenberger opened 4 years ago
Thanks for your question. I didn't investigate it before, but now have. Here's my understanding:
The reason is subtly stated in another paper.
ART outperforms the other indexes for all workloads and key types except Scan/Insert, where its iteration requires more memory access than the OpenBw-Tree.
One way to intuitively think about why ART requires more memory accesses is that user keys are only stored in the leaf nodes whereas in RB trees every node stores a key. Which means ART has additional supporting links/nodes to get to those leaf nodes. Hence any traversal that requires you to go through a lot of next-user-keys would require passing through those supporting links/nodes.
I ran some experiments with the dataset of 100,000 random 64 bit integers. RBtree will have 2n
links i.e. 200,000 since it has a node for every user key and every node has a link to its left, right child.
For ART it depends on the dataset. In this case the node counts and links are:
Node | #Nodes | #Links | #Nodes per level |
---|---|---|---|
Node4 | 20785 | 83140 | L3=20193 L4=592 |
Node16 | 6252 | 100032 | L3=6252 |
Node48 | 0 | 0 | |
Node256 | 129 | 33024 | L1=1, L2=128 |
LeafNode | 100000 | L3=4786, L4=94029, L5=1185 |
In total 216,196 links. Hence ART has about 16k more links than RB tree. More links means more load instructions and likely more cache misses. Height of ART is 5 whereas RBtree's height is 20. Point queries only require traversing the height hence ART is faster.
Range scan is a ceilingEntry
call + a number of successor
calls depending on scan length. The ceilingEntry
involves a depth traversal just as in get
and in some cases if you reach the bottom, a successor
call. A scan of length 0 will only require a ceilingEntry
call largely involving the depth traversal, in which case ART should be faster because of smaller depth. As we increase the scan length, more of those supporting links need to be traversed. For example if the present user key came from the 2nd child of a Node4
, the next user key is simply the next child in the same Node4
. Similarly the next child. After this we're required to visit Node4's
parent and go through the supporting links. And this seems to be the case for this dataset as most of the leaf nodes are at L4 and have Node4
as parents on L3. Which means scan length of 4 should be the turning point and from the plots it does look like it is.
Plots of perf events.
For better understanding of the traversal, here's a raw dump of the moves taken to get to the next key for a single YCSB scan request of length 50 in both ART, TreeMap.
It'd be nice to somehow derive a theoretical worst case bound for ART for the number of links visited depending on n
, span
, length
and maybe a distribution
factor. In case of RBTree it is straightforward as the tree structure is dependent only on n
.
Hope that explains why ART is slower for range scans as compared to RBTree. Let me know if you have further questions.
2n
links for binary trees? Isn't it n + 1
(with null links)?
Hm, I think you're right. However, in binary trees you also have to go up even more links sometimes during range-scans.
But I think in comparison to a binary tree a cache oblivious B+ tree might be even way better because of the locality and CPU caches. That is, I don't know how a cache oblivious indirect binary tree compares (with just one array and without direct pointers), which is built using the Van Emde Boas layout.
BTW: Any plans to kind of "advance" your ART to a HOT?
https://dbis.uibk.ac.at/sites/default/files/2018-04/hot-height-optimized-author-version.pdf
On another note, do you have any recommendations about the JVM implementations and hardware support (for instance SIMD instructions)? Seems in-memory it's now all about using cache-lines efficiently, but I don't know what's possible with JVM based languages nowadays. Also, I'm not sure if it makes sense to use Azul Zing or the GraalVM (the latter makes sense for me, for instance as I'm using lambdas often times, and small intermediate objects seem to be not that costly).
Hi,
do you have any idea why range scans for 64Bit Integers are faster in Red/Black trees (TreeMap)?
BTW: Regarding the height they improved their trie with HOT, which especially should address the issue to reduce the height. I think they now use tries within the trie, basically.
Kind regards Johannes