rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.96k stars 12.53k forks source link

BTreeSet and BTreeMap append optimization #34666

Open xosmig opened 8 years ago

xosmig commented 8 years ago

When the largest key of one tree is less than the smallest key of another tree, there is a way to merge trees using only O(log n) operations. Maybe logarithmic merge should be a separate method. I'm ready to work on it.

UPD: Also sometimes it may be better to push elements of one tree to another one with O(min_length log(max_length)) operations instead of rebuild with O(n + m) operations (like in #32526).

But the problem is that it's hard to determine which approach is better in a particular case.

xosmig commented 8 years ago

CC @gereeter

alfiedotwtf commented 7 years ago

Am I missing something? I don't see append() on BTreeSet

xosmig commented 7 years ago

@alfiedotwtf what's wrong with this one? https://doc.rust-lang.org/std/collections/struct.BTreeSet.html#method.append

alfiedotwtf commented 7 years ago

Gah! I don't know how I managed to land on old documentation:

https://doc.rust-lang.org/1.6.0/std/collections/struct.BTreeSet.html

Thanks @xosmig :)

kirillkh commented 7 years ago

I would love to see this fixed! I have recently developed a data structure [1], which is supposed to be faster than BTreeSet for a specific usage pattern, but I can't write a fair benchmark until BTreeSet supports a O(k + log n) delete-range operation.

[1] https://users.rust-lang.org/t/announcing-teardowntree/8259

Andlon commented 4 years ago

I think perhaps the current append implementation is a little problematic. Arguably, one would expect a "batch" operation like set.append(other_set) to be generally the better choice when you have to insert one set into the other, because presumably it is able to take into consideration certain optimizations that can not be made when inserting one item at a time. However, the complexity O(n + m) means that if, say, n is large and m is small, the cost of calling append is enormous.

I ran into this problem today, where for a relatively large computational problem, in which I called append roughly 500k times on with n ~ 100k and m ~ 10, my program would not finish for half an hour. After changing to insert in a loop, the program finished in 2 seconds.

There are two problems here, as I see it:

Of course, ideally append would work well in all cases, but deciding when to pick each strategy is surely not a simple problem...

memoryruins commented 4 years ago

cc @ssomers (due to their many improvements to BTreeMap/BTreeSet in the last couple of years)

ssomers commented 4 years ago

Rest assured, I was subscribed... It's not hard to imrpove this in many cases, as Andlon proved, but hard to prove it doesn't become worse in some other situations. A lot of benchmarking to do.

mdacach commented 10 months ago

I was bitten by this in the context of purposefully trying to merge a smaller BTreeSet into a larger one[^1]. Neither the function name nor the documentation helped to diagnose the issue. I think that improving the documentation for append would already help significantly.

[^1]: "Small to Large" (😅) optimization: https://soi.ch/wiki/smaller-to-larger/