ssbc / jitdb

A database on top of a log with automatic index generation and maintenance
50 stars 7 forks source link

Radix idea: index that can look up any mentioned msg key #40

Closed staltz closed 3 years ago

staltz commented 3 years ago

some more crazy ideas to use bitvectors: I know it's not feasible to have jitdb indexes for votes, because there are two many vote roots (thus one index for each vote root), and so currenly in ssb-db2 we use leveldb for that. But what if we did something similar to radix sort?

Lets say we take the msgid at msg.value.content.vote.link, convert it to buffer, and take the 1st bit, then all messages that have the 1st bit of msg.value.content.vote.link equals 1 are indexed in a bitvector. You could do that for the 2nd and 3rd bits too, with the goal being to just limit the search space, and then once it's small enough, just scan the results in memory. This could be combined with #8.

i.e. these bitvectors:

Then we can create a query such as votesFor(msgid), pick the first 3 bits of msgid and then AND queries for the corresponding bitvectors mentioned in the list above. Maybe an optimal amount of these indexes could be about ~16?

If we have 16 bitvectors for each shape of mentioned key, then we start to have a lot of indexes, e.g.:

In that case, maybe we could save all 16 vectors into one file, then we'd have bitmatrix files. Nx16 as opposed to Nx1:

Open questions

How much does 16 bits reduce the search space?

Lets assume the msg IDs have their 256 bits randomly distributed. If we have only one index for 1st bit, say B=1, then the search space would be reduced in half. If we has two indexes, for the 1st and 2nd bit, say B=2, then the search space would be reduced to one quarter of the original N messages in the log.

In the general case, the search space is reduced to N / (2**B).

Assuming N is one million, you need approx B=19 to get the search space reduced exactly to your target msg key. See wolfram alpha: https://www.wolframalpha.com/input/?i=%281+million%29+%2F+%282**b%29+%3D+1+solve+for+b

Now, maybe we can do less than 19, maybe we don't need the search space to be exactly our desired result, there could be some false positives in there if we assume that scanning the search space is fast enough. Just have to experiment and see, but given that 19 is close to 16, maybe we have to assume that 16 is a realistic value for B.

Are there bad performance issues with this idea?

Hard to know exactly how this will perform, but there's these new costs involved:

Could the GPU be used here?

Since we have a matrix of zeros and ones, this looks like a good use case for the GPU, and for parallelization, since we can just give a parts of the bitmatrix to different "GPU threads" and it should be mostly independent to each other. This is something we would do only if the linear scan of the bitmatrix is slow enough to justify the overhead in the GPU bus interface.

staltz commented 3 years ago

How much does 16 bits reduce the search space?

Note, the math I ran for this question was for the msg_key_bitmatrix.index where each message in the log has its own unique msg id. Under these same assumptions, with B=10, we would reduce the search space to ~900 msgs, which is probably fast enough to stream and run a pull.filter.

However, for other indexes like msg_value_content_vote_link_bitmatrix.index, B=10 would probably reduce the search space even more than the one above, so maybe we get a dozen of msgs, thus quite fast. Maybe the optimal number is not 16, but 10? That's even better. File size would be less than approx 1.5MB.

mixmix commented 3 years ago

Idea - instead of using the msgkey everywhere, use a representation for each key which is "unique enough" for our log.

Similar to idea where most git commits can be uniquely referenced by first 6 chars. Only add 7th when you get collisions...

Another idea is have a table which maps each key to an int. Then use those ints everywhere.

staltz commented 3 years ago

Really good point about the comparison with git commit ids.

Note to self: if 1 million msgs roughly corresponds to 5 years worth of hermies data, then with partial replication, 1 year worth of data might be 200k messages, and then maybe we can use a smaller value of B such as 8.

arj03 commented 3 years ago

The thing with these indexes is that the result is rather small. So for every message you might only have less than 100 other messages linking to them. We do already have indexes that are offset based and store ints or floats. So it might actually work to compress these result set to something that can be stored in a 32 bit one using a RLE scheme.

staltz commented 3 years ago

@arj03 Can you clarify? I read your comment several times and I'm still unsure what you're referencing, are you quoting mix's idea or my first post?

The thing with these indexes is that the result is rather small.

By "thing" do you mean problem or advantage? Is "these indexes" the offsets indexes or the radix idea indexes? And what do you mean with "results"?

I'm sorry for all these questions, I just often get lost if I'm not sure what you are referring to.

arj03 commented 3 years ago

I was mostly thinking of what mix mentioned about the integer offset and trying to compress our current level implementation into something very small. I'm not sure it is possible, I ran a few tests and it doesn't compress enough. Sorry to derail the original thread, this was another solution for the same problem.

staltz commented 3 years ago

Got it, thanks for clarifying! (PS I don't mind the off topic, in my experience all good conversations go off topic)

staltz commented 3 years ago

Idea: constrain B to multiples of 8, and constrain the underlying implementation to use either Uint8Arrray, Uint16Array or Uint32Array for the bitmatrixes.

staltz commented 3 years ago

This is now done, it's called prefix indexes.