Closed patelprateek closed 1 year ago
There is no difference between the SBBF in this repo and Kudu's; I'm the author of both.
You can tune the fpp by setting any space you like; there is no default space bits per value, though there is a fixed hash function parameter of 8 per value, which cannot be tuned. You can use libfilter_block_bytes_needed
or MinSpaceNeeded
or BytesNeeded
if you want to find out how much space is needed for a certain number of values and certain fpp. If you know how much space you need, you can construct a filter with CreateWithBytes
or libfilter_block_init
.
While i am going through some amazing work you are doing , wanted to get your thoughts on some implementation details or caveats
IIUC SBBF is same as in https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h ? Are there any implementation difference or caveats in your particular approach (any significant perf difference) ? Bit confused around terminology here between
block
andsplit block
so wanted to understand better if your implementation is something additional on top of classical blocked bloom filter.I found your repo through : https://arxiv.org/pdf/2101.01719.pdf , it says for the split block bloom fpp are somewhat fixed due to pre-determined number of hash functions used . In your repo example is see
double fpp = 0.0065;
, is this because we are using a fixed number of hashes and bits per value (pre-configured at the given fpp level) . Am i able to tune this for lower fpp by using more hash functions and may be more bits per value ? What is the current default bits per value set in current SBBF implementation ? which parameter is that ?In the apache kudu project which uses block bloom filter , here : https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L66 it says we are able to get 0.1% fpp with 15 bits per value , is it possible to achieve this in your split block BF implementation as well , or are we stuck to 0.4% - 19% range only ,
thanks for bearing with my naive questions as i am new into this area and still trying to figure out best filter for my case where i have strict latency requirement and need to insert ~10-100 million ids , these ids are not uuid but sequential range [0-100M] , so was trying to use roaring bitmaps before but now looking into this probabilistic filters since constructing filters is a potential bottleneck for me .