mafintosh / hyperdb

Distributed scalable database
MIT License
752 stars 75 forks source link

Question: Iteration order + sorting #78

Open andrewosh opened 6 years ago

andrewosh commented 6 years ago

Hey all,

I have a quick question that might betray my ignorance about HyperDB internals. I'm starting to build out a CouchDB-style indexing system on top of HyperDB, and I'm trying to figure out if there's a way to use the underlying HAMT without having to build a secondary index (a B-tree) on top of it.

Pretty sure that thought boils down to one of iteration order. I notice in the current master, results are sorted based on path hash. Does the path hashing that's used to construct the prefix trie make it impossible to do an efficient ranged iteration over keys in pre-hash lexicographic order (a la LevelDB)?

If so, does this represent a fundamental limitation on building LevelDB-style applications on HyperDB without adding an additional level of indexing? If that's the case, would designating certain path ranges as non-hashable (make the prefix trie out of the actual path values) completely break the data structure, or would it only make it less scalable, as the trie would no longer be uniformly distributed?

Hope that's not too much! I'm trying to gauge my options before pushing forward with Plan B-Tree.

mafintosh commented 6 years ago

Hi @andrewosh :wave:

The iteration is based on the hash sort order yea. I'd love to get it working using lexicographic in a distributed fashion but I think that part is tricky.

Here is the thing though, I'd argue you don't need lexicographical sorting at all in 90% of your usecases what involve building secondary indexing on top. The key thing here is that the iterator is determinically sorted across nodes as it is. This is hugely important as that means you can do distributed merge-sort like queries efficiently.

For example if you wanted to JOIN two tables, (a table in hyperdb being presented with a bunch of data stored in a folder) you can easily do it using the iterator api.

Make two iterators, a, b. These are still sorted so simply read a value from each and then keep reading from the iterator with the "lowest" hash sort order until it catches up with the order. Similarly you can use this to efficently intersect two iterators. All of this is enough to make powerful applications like distributed search, distributed indexing and much more. Take a look at the diff api for an implementation of this (we need to expose the seek api).

I've been thinking about ways you can easily improve the sort order as well. Fx if you used the above scheme to implement a distributed search engine you might still wanna "weight" the results if a lot of results are produced. An easy way to do this would be to support a "weighted" hash.

// make a hash that is sorted by the first char and then the hash
function hash (key) {
  return Buffer.concat(Buffer.from(key[0]), siphash(key))
}

Something like this would allow you to weight your results without filling up the trie too much as your hash is still close to uniform.

Does this make any sense? :)

mafintosh commented 6 years ago

I forgot to mention that my schema above relies heavily on the fact that you can efficiently seek to any folder (which you prob know).

andrewosh commented 6 years ago

@mafintosh Thanks for the detailed response! Just for anyone following along who might not be in IRC, it looks like some preliminary results show that searching the trie by <key> + <hash> might work with a fixed or logarithmic overhead in node requests, and without messing with the scaling properties of the DB (lookups will still be evenly distributed). In many cases this could be a fine trade-off!

More investigation/experimentation in progress...

scriptjs commented 6 years ago

@andrewosh @mafintosh Very cool! @andrewosh Let me know if you have something to see or play with that demonstrates this.