jneem / imbl

Blazing fast immutable collection datatypes for Rust.
Mozilla Public License 2.0
84 stars 13 forks source link

Fix vector merge algorithm to prevent the tree depth from growing #35

Open jneem opened 3 years ago

jneem commented 3 years ago

Prompted by #33, I dug into Vector some more this weekend and I'm not super encouraged by what I found. I made some tweaks to support smaller branching factors (because when the branching factor is large then it's really hard to hit corner cases with proptesting and fuzzing), and I immediately ran into some failed tests. In particular, test_inserts caused a panic, and when debugging the panic I found that it was producing very large skinny trees. Now some issues may have been caused by my amateurish attempts to support smaller branching factors, but the skinny tree thing seems to be an issue in unmodified im also: if I modify test_inserts to make larger vectors, I see that vectors of length 40000ish already have tree height 7. But with a branching factor of 64, height 3 should already get us to 200k!

Part of my difficulty is that I don't really understand the RRB merge algorithm. The description in the original paper is not really complete, in my opinion, and neither is the one in L'orange's Masters thesis. It appears I'm not the first person with this complaint, see this, and also this and this for more background. @nomad010 you said you had an independent re-implementation of the RRB algorithm -- do you have any thoughts on this?

nomad010 commented 3 years ago

I have been unhappy with the RRB documentation since I started implementing it nearly 2 years ago. I only really understood the intention of the concatenation balancing algorithm relatively recently. test_inserts was written by me in my own project and was ported over to im/imbl when I saw it produced bugs and weird behavior. One thing that might clear up some things for you: there are two different RRB concatenation algorithms. The one this project implements is from Nicolas Stucki's masters thesis Turning Relaxed Radix Balanced Vector from Theory into Practice for Scala Collections.

The concatenation algorithm presented there is much more approachable than the one from L'orange. Unfortunately, I suspect that this algorithm bounds the height to O(log(N*C)) where C is the number of concatenations that have occurred. In the test_inserts C=N and so the height is about twice what we expect.

Currently, my implementation also produces these problematic trees, however if you implement test_inserts in L'oranges librrb, you get trees with the expected small height. I am in the process of rewriting my librrb implementation from the ground up to work with L'oranges concatenation algorithm. (That or replacing it with a custom B-Tree implementation).

I'm not sure what this means for this means for imbl's vector, it would be possible to implement L'oranges librrb concatenation algorithm with a lot of effort, His actual implementation is not pleasant to look at it, but at least it works.

jneem commented 3 years ago

Thanks, that's super helpful! I hadn't read Stucki's thesis before. The algorithm there is clear, at least, although I'm not sure what properties it's supposed to have. I guess the inputs to mergeRebalance consist of a "left-packed tree" on the left (for whatever precise definition of "left-packed" is being used), a tree in the middle that I guess should also be left-packed, and a tree on the right that used to be left-packed until you removed some node on its left-most path the left-most child of the root (which still leaves it left-packed, I think?). And then in order for the merging algorithm to be correct, mergeRebalance should return a left-packed tree. I guess that's what Stucki's algorithm isn't quite doing, and that's why the trees get large.

nomad010 commented 3 years ago

Sorry, I've been pretty busy recently. I meant to add that what after your comment is that what L'orange (and the original paper) calls left-packed is a regular M-1 - M B-Tree. I think its this invariant that is broken, either by a bug in im or a bug in the algorithm. I haven't re-analysed Stucki's work given what I now know so I could be wrong about the problem being there.

I'm happy to help fix the implementation if you want? The first thing to do there is ensure those mergeRebalance calls properly return M-1 - M trees.

jneem commented 3 years ago

Sure, I'd be happy for any help you have time for!

I am a bit confused about the M-1 -- M trees thing, though, because let's say M=32 and you have a tree that's almost completely dense except that it's right-most node has only 1 leaf, and you append a tree that's completely dense. If you want to end up with a 31 -- 32 tree in the end, then you either end up shifting/copying all the leaves or you have to distribute the "missing" 31 entries over 31 different nodes. None of the implementations (or papers) that I saw said anything about distributing the gaps in this way, so for that reason I figured they were going for the other invariant mentioned in the paper: the one where you just have an upper bound on the number of linear search steps you need to take. AFAICT that invariant allows you to have leaves that are way underfull, as long as there aren't too many of them.

nomad010 commented 3 years ago

I believe the copying and shifting of is handled by createConcatenationPlan and executeConcatenationPlan so I was looking mainly at the those functions for insight there.

I figured they were using a combination of the two invariants. One invariant to guarantee the tree does not become too lop-sided and another to faciltate radix-searching. You could be right, though, and they are just using the linear search step invariant only. I'll do some reading to try and clarify that.

nomad010 commented 3 years ago

I've had another read through of imbl's concatenation and the algorithm in Nicolas Stucki's thesis. I'm fairly certain the two correspond pretty much exactly and there no major differences. The only thing I can think of is that Stucki's algorithm might just be a shifting of nodes instead of a rebalancing.

The RHS of the concatenation steps will almost be have M^2 subtrees and the LHS be sparse. This will lead to the LHS getting M^2 subtrees and the RHS becoming sparse. This just causes the sparse part to move to another part of the tree and can still cause problems. This is just an educated guess, I don't have too much evidence other than what I've seen.

Also, you were right. They do not specifically guarantee M-1 -- M trees, but they do have a process where they compact trees with fewer than M - e/2 subtrees. The default value for e seems is 2 in librrb. In a test case like test_inserts, it seems that lots of 31 or 32 trees are created.

I think the next step might be to implement L'oranges concatenation algorithm in imbl. We can grab implementation details from L'orange's librrb, but it is complicated a bit by the library design itself.

jneem commented 3 years ago

Sure, that sounds like the logical thing to do. I'll also push a branch I have lying around that has a bunch of debugging printfs and reduces the node size to 4. It makes stepping through example much easier...

By the way, one thing I'd really like to produce is some document (maybe a blog post or something) really explaining the algorithm. I think that's something sorely missing right now...

jneem commented 3 years ago

Ok, I've read create_concatenation_plan and I see what you mean about not guaranteeing 31 -- 32 trees: it looks to me like they basically try to (in that when they encounter a node with 30 or fewer children, they try to pack in subsequent children to make everything 32), except that they're willing to bail out early if the total number of nodes is small enough. I didn't understand where you got the factor 2 in M - e/2, though? Here it looks to me like M - e, where e is RRB_EXTRAS, which is 2 by default)

nomad010 commented 3 years ago

Its kind of silly but there is another parameter RRB_INVARIANT which is 1. I think it should actually just actually just be RRB_EXTRAS/2 (e/2).

I think it would be great to write some sort of blog post on this when we get it working since existing documentation on the RRB tree is just so lacking.

jneem commented 3 years ago

Right, so the way I was thinking about it was that RRB_EXTRAS is related to the number of extra linear search steps allowed, while RRB_INVARIANT is the "1" in M/M-1 trees. I mean, I don't know that they're exactly that, but at least RRB_INVARIANT is used to somehow control how many children each non-rightmost node has, while RRB_EXTRAS is used to ensure that the number of children is not too large compared to the number of grandchildren (which isn't exactly the same as the number of extra search steps, though, because that amounts to ensuring that the number of children is not too large compared to the number of leaves).

nomad010 commented 3 years ago

Okay, so I went through some of the literature again (I found that L'orange's website also links to a youtube video by Phil Bagwell on RRB's trees). They don't seem to actually have a height guarantee (if you look at L'orange's thesis 3.4.3, they have a lgN guarantee in some situations).

I think I could implement that concatenation algorithm now and understand what I would need do for it, I'm more familiar with my own RRB implementation (which I am also rewriting anyay), but I think I could implement it in im/imbl with some exploration.

jneem commented 3 years ago

That would be awesome! So my understanding is that they do guarantee logarithmic height if all you do is concatenation, but as soon as you start mixing that with other operations then the guarantees go away?

nomad010 commented 3 years ago

I believe what you said makes sense. In particular I think the only operation you need to worry about with concatenations is splits.

Sorry work picked up and I've had more or less no time to look at this over the last few weeks. I will be able to get a fresh look at it closer to the end of the year.

Maybe I should go over my plan with you. I have a private repo that I'm implementing my own RRB tree external to imbl. (It's also external to my other RRB implementation that I'm using in other projects). I will invite you to the repo when I'm able. It follows a different structure and has different invariants and properties to imbls one, but I find it much clearer to implement it this way. Also the invariants aren't so dissimilar as to make it not an RRB tree.

After I've done testing on my implementation I will report back to how that goes and hopefully implement it in imbl.

jneem commented 3 years ago

Great! Coincidentally, I have also just managed to find time to do some work here. I'm trying to get some better tests, fuzzing, and benchmarking going, in the hope that it will make it easier to change the algorithm with confidence...

jneem commented 3 years ago

gonna go with a less inflammatory title...