E (95% range scan, 5% insert) performance

JohannesLichtenberger commented 4 years ago

Hi,

do you have any idea why range scans for 64Bit Integers are faster in Red/Black trees (TreeMap)?

BTW: Regarding the height they improved their trie with HOT, which especially should address the issue to reduce the height. I think they now use tries within the trie, basically.

Kind regards Johannes

rohansuri commented 4 years ago

Thanks for your question. I didn't investigate it before, but now have. Here's my understanding:

TLDR

The reason is subtly stated in another paper.

ART outperforms the other indexes for all workloads and key types except Scan/Insert, where its iteration requires more memory access than the OpenBw-Tree.

Longer version

One way to intuitively think about why ART requires more memory accesses is that user keys are only stored in the leaf nodes whereas in RB trees every node stores a key. Which means ART has additional supporting links/nodes to get to those leaf nodes. Hence any traversal that requires you to go through a lot of next-user-keys would require passing through those supporting links/nodes.

I ran some experiments with the dataset of 100,000 random 64 bit integers. RBtree will have 2n links i.e. 200,000 since it has a node for every user key and every node has a link to its left, right child. For ART it depends on the dataset. In this case the node counts and links are:

Node	#Nodes	#Links	#Nodes per level
Node4	20785	83140	L3=20193 L4=592
Node16	6252	100032	L3=6252
Node48	0	0
Node256	129	33024	L1=1, L2=128
LeafNode	100000	L3=4786, L4=94029, L5=1185

In total 216,196 links. Hence ART has about 16k more links than RB tree. More links means more load instructions and likely more cache misses. Height of ART is 5 whereas RBtree's height is 20. Point queries only require traversing the height hence ART is faster.

Range scan is a ceilingEntry call + a number of successor calls depending on scan length. The ceilingEntry involves a depth traversal just as in get and in some cases if you reach the bottom, a successor call. A scan of length 0 will only require a ceilingEntry call largely involving the depth traversal, in which case ART should be faster because of smaller depth. As we increase the scan length, more of those supporting links need to be traversed. For example if the present user key came from the 2nd child of a Node4, the next user key is simply the next child in the same Node4. Similarly the next child. After this we're required to visit Node4's parent and go through the supporting links. And this seems to be the case for this dataset as most of the leaf nodes are at L4 and have Node4 as parents on L3. Which means scan length of 4 should be the turning point and from the plots it does look like it is.

Plots of perf events.

Average time L1-dcache-loads L1-dcache-load-misses LLC-loads

For better understanding of the traversal, here's a raw dump of the moves taken to get to the next key for a single YCSB scan request of length 50 in both ART, TreeMap.

It'd be nice to somehow derive a theoretical worst case bound for ART for the number of links visited depending on n, span, length and maybe a distribution factor. In case of RBTree it is straightforward as the tree structure is dependent only on n.

Hope that explains why ART is slower for range scans as compared to RBTree. Let me know if you have further questions.

JohannesLichtenberger commented 4 years ago

2n links for binary trees? Isn't it n + 1 (with null links)?

Hm, I think you're right. However, in binary trees you also have to go up even more links sometimes during range-scans.

But I think in comparison to a binary tree a cache oblivious B+ tree might be even way better because of the locality and CPU caches. That is, I don't know how a cache oblivious indirect binary tree compares (with just one array and without direct pointers), which is built using the Van Emde Boas layout.

BTW: Any plans to kind of "advance" your ART to a HOT?

https://dbis.uibk.ac.at/sites/default/files/2018-04/hot-height-optimized-author-version.pdf

JohannesLichtenberger commented 4 years ago

On another note, do you have any recommendations about the JVM implementations and hardware support (for instance SIMD instructions)? Seems in-memory it's now all about using cache-lines efficiently, but I don't know what's possible with JVM based languages nowadays. Also, I'm not sure if it makes sense to use Azul Zing or the GraalVM (the latter makes sense for me, for instance as I'm using lambdas often times, and small intermediate objects seem to be not that costly).

rohansuri / adaptive-radix-tree

E (95% range scan, 5% insert) performance #2

TLDR

Longer version