narimiran / sorta

SortedTables in Nim, based on B-Trees
MIT License
17 stars 3 forks source link

Bulk Loading Algorithm(s) #2

Open c-blake opened 4 years ago

c-blake commented 4 years ago

B-Trees have efficient algorithm(s) to populate a tree for later modification from a set of entries already in-order (or in-reverse order). The algorithm is fast both in constant factors and asymptotically and creates a dense tree. https://en.wikipedia.org/wiki/B-tree has details.

c-blake commented 4 years ago

Well, I have an impl over in https://github.com/c-blake/adix of this. The usual B-Tree invariants are violated in the course of this initial phase and require a final touch up at the end. So, for example in Unix shell script-ese (I use Zsh , but it would probably work in Bash, too) you can do:

cd adix/tests
nim c ppss
nim c btshell
(for i in {1..100}; do echo A1 $i 0; done; echo D1 0; echo c; echo p)| ./ppss | ./btshell | less -R  # -R for color structure highlighting

The A1 $i 0s do the initial phase bulk add (with zero extra slots) while the D1 0 does the finalization/B-Tree invariant restoring part. You can also do A1 $i 1 and D1 1 to do only 18, not 19. (The default is a 10..20 link tree.) That could help avoid some splits if you think you might add a few elements here & there after initial construction but still get you 18/19 = 95% utilization. I suspect, but do not know for sure, that this algorithm can be generalized to allow any valid amount of spare space (well, less than half to satisfy the B-tree invariants anyway).

You'd need >~ 19*19=361 objects to start to see a third level which actually isnt' so bad if you pipe to a pager like less, but you could also just compile btshell with -d:btTall to get more structure sooner. Tall is often more useful for debugging than benchmarking/actual use.