jbapple / libfilter

High-speed Bloom filters and taffy filters for C, C++, and Java
Apache License 2.0
32 stars 6 forks source link

split block bloom filters implementation #22

Closed patelprateek closed 1 year ago

patelprateek commented 2 years ago

While i am going through some amazing work you are doing , wanted to get your thoughts on some implementation details or caveats

IIUC SBBF is same as in https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h ? Are there any implementation difference or caveats in your particular approach (any significant perf difference) ? Bit confused around terminology here between block and split block so wanted to understand better if your implementation is something additional on top of classical blocked bloom filter.

I found your repo through : https://arxiv.org/pdf/2101.01719.pdf , it says for the split block bloom fpp are somewhat fixed due to pre-determined number of hash functions used . In your repo example is see double fpp = 0.0065; , is this because we are using a fixed number of hashes and bits per value (pre-configured at the given fpp level) . Am i able to tune this for lower fpp by using more hash functions and may be more bits per value ? What is the current default bits per value set in current SBBF implementation ? which parameter is that ?

In the apache kudu project which uses block bloom filter , here : https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L66 it says we are able to get 0.1% fpp with 15 bits per value , is it possible to achieve this in your split block BF implementation as well , or are we stuck to 0.4% - 19% range only ,

thanks for bearing with my naive questions as i am new into this area and still trying to figure out best filter for my case where i have strict latency requirement and need to insert ~10-100 million ids , these ids are not uuid but sequential range [0-100M] , so was trying to use roaring bitmaps before but now looking into this probabilistic filters since constructing filters is a potential bottleneck for me .

jbapple commented 2 years ago

There is no difference between the SBBF in this repo and Kudu's; I'm the author of both.

You can tune the fpp by setting any space you like; there is no default space bits per value, though there is a fixed hash function parameter of 8 per value, which cannot be tuned. You can use libfilter_block_bytes_needed or MinSpaceNeeded or BytesNeeded if you want to find out how much space is needed for a certain number of values and certain fpp. If you know how much space you need, you can construct a filter with CreateWithBytes or libfilter_block_init.