stevendesu / jsindex

1 stars 0 forks source link

B-Tree-style indexes for Number fields #19

Open stevendesu opened 5 years ago

stevendesu commented 5 years ago

Currently there's no convenient way to search for ranges:

const collection = [
    {name: "Alice", age: 23, salary: 55000},
    {name: "Bob", age: 28, salary: 38000},
    {name: "Charlie", age: 42, salary: 80000},
    {name: "Dana", age: 23, salary: 40000},
    {name: "Eve", age: 18, salary: 15000}
];

const results =collection.search({
    salary: [30000, 45000]
});

// Expect:
results == [
    {name: "Bob", age: 28, salary: 38000},
    {name: "Dana", age: 23, salary: 40000}
];

This is a useful feature that I'd like to add... I just have to think about how to do it.

The index itself isn't that complicated, it will just take some work:

The harder part, I think, is actually building the index. We could take a binary tree approach where each element is "more" or "less than" the previous one, but in the worst case scenario this will lead to trees heavily weighted to one side (particularly for sorted collections). By finding the minimum and maximum values we could use a heuristic approach (assume values are evenly divided between the minimum and maximum), but this requires a second pass through the collection to build the indexes.

I'll have to do some research into algorithms that could help us here.

stevendesu commented 5 years ago

So here's the problem with using B-trees, specifically: B-trees have very expensive insertions in exchange for extremely fast reads. These expensive insertions almost always involve sorting the elements in a leaf node, and occasionally involve re-indexing the entire tree when you need to add a new level (although since we know the length of the array going in, we can calculate the proper level to avoid this on the first indexing)

This constant re-sorting of elements in the leaf node will cause the initial index process to be far too slow. We need a self-balancing tree algorithm that doesn't depend on so much sorting (if possible)

Technically we could pre-sort the array of possible index values and then build the index from this pre-sorted array (which only involves one Quicksort operation on an array of length n rather than potentially thousands of Quicksort operations on several arrays of length 2-4), but I'd prefer to avoid running a sort algorithm at all if it can be done. Even Quicksort is a O(n*log(n)) operation, and I want to keep the indexing to O(n) where possible.

I may be over-estimating the cost of insertions with a B-tree. Because a limited number of elements needs to be sorted (2-4) and this number doesn't grow with the size of the array, it's arguably a constant-time operation. It's possible that there's no ordered index which can be built in O(n) time, but a B-tree may be the closest I can get with O(log(n)) insertions over n elements for a net O(n*log(n)) (no worse than pre-sorting the values)