Closed meling closed 1 year ago
On May 17, 2023 3:16:58 AM GMT+02:00, Hein Meling @.***> wrote:
false positive). In my experience with BBHash, false positives are rare.
Can you provide some example? Given the number of bits used by these maps, what you claim is simply impossible. There are not enough maps for that to happen.
It might be possible that versions of BBHash using a lot of space (like, twice RecSplit) can sometimes detect if a key wasn't in the original dataset, as they use, indeed, more space. But when you use 3 bits per key you cannot detect false positives unless you're very lucky (you can easily do the math).
Sorry, I'm of course wrong and I knew it... a big lapse of mind to claim it was rare. Below are some statistics from my reimplementation of BBHash. Looks like the number of false positives grows with the size of the map (0.82, 0.91, 0.96). However, the way we use BBHash, there is still some value in returning 0 (key is definitely not in the set).
I'm guessing that with RecSplit and using even fewer bits per key (e.g. 1.6), what you are saying is that these probabilities approaches 1??
=== RUN TestManyKeys/name=Sequential_/gamma=2.0/keys=1000
bbhash_test.go:151: BBHash(gamma=2.0, entries=1000, levels=6, bits=3392, size=424 B, bits per key=3.4)
bbhash_test.go:152: found 82373084 false positive keys: 0.823731
=== RUN TestManyKeys/name=Sequential_/gamma=2.0/keys=10000
bbhash_test.go:151: BBHash(gamma=2.0, entries=10000, levels=9, bits=33408, size=4.1 KB, bits per key=3.3)
bbhash_test.go:152: found 91337297 false positive keys: 0.913373
=== RUN TestManyKeys/name=Sequential_/gamma=2.0/keys=100000
bbhash_test.go:151: BBHash(gamma=2.0, entries=100000, levels=11, bits=330496, size=40.3 KB, bits per key=3.3)
bbhash_test.go:152: found 96737320 false positive keys: 0.967373
Yes, it's a matter of space. The more space you use, the more bits you have to test for false positives. 1.44 bits per key are just necessary to return a correct output on the key set. Anything more than that counts as bits you can consider as signatures. So if you have 1 bit more (2.44 overall) you might catch 50% false positives (but that doesn't really happen, it's an idealized model).
If you wanna obtain that result with RecSplit, just add the bits. Use a bit vector to store, say, 1 bit per key, which is the parity of the hash of the key. When you pass a key through RecSplit, check that the parity of its hash is equal to the stored parity. If not, return -1 or something. Easy! It will still use much less space than BBHash at that compression level, and catch 50% false positives.
BTW, Peter Sanders has a parallel implementation that builds quickly maps with 1.6 bits/key.
Thanks for the idea!
Yeah, I think I've seen the parallel implementation, but I'm testing on an Apple M2 Max, and I had some problems compiling it on this machine, but maybe that was the SIMD/GPU versions. Will check it again to see if there is a non-x86 variant.
Let me know if it is a different repo than this one: https://github.com/ByteHamster/GpuRecSplit
Closing as resolved; nothing to fix. Thanks for the help.
After playing around with RecSplit a bit, I came to realize that the lookup semantics (the operator()) is different from that of BBHash. That is, in RecSplit a key that is not in the map may return any value. In BBHash, a key that is not in the map will return a zero value, but may also return any value (a false positive). In my experience with BBHash, false positives are rare.
I'm wondering if RecSplit could be adjusted to a similar semantics; that is to return
0
if it is known that the key is not in the map.1, 2, ..., n
corresponding to then
keys.To be clear, I'm not asking that you change your implementation in this repository, only whether or not it is an inherent limitation with RecSplit's underlying data structures. I've looked at the code and the paper, but on first glance it was not trivial to understand.