jedisct1 / rust-bloom-filter

A fast Bloom filter implementation in Rust
BSD 2-Clause "Simplified" License
235 stars 51 forks source link

Number of hash functions (k) should be floored, not ceiled #45

Open ashtul opened 3 months ago

ashtul commented 3 months ago

Historically, when calculating the number of hash functions (k) for Bloom filters, developers have commonly opted to use ceil() on the resulting float value. However, this approach is suboptimal. Instead, calling floor() on the result offers clear advantages over ceiling.

Flooring the number of hash functions does not meaningfully degrade the false-positive rate, as reducing by just one function still maintains a similar error probability. The risk of degradation only becomes a concern when additional hash functions are removed. However, by using floor() instead of ceil(), benchmarks demonstrate a notable 10% improvement in processing time. This improvement stems from the reduced number of hash functions, which decreases the filter's fill rate while keeping the same total bit count. As a result, the computational efficiency is enhanced without significantly compromising accuracy.

Moreover, having fewer hash functions means that, for elements that are 'probably in the set,' fewer bits need to be checked. For elements that are 'definitely not in the set,' each hash function has a higher chance of failing earlier due to the lower fill rate, further boosting overall efficiency.

With ceil()
test bench_get_100     ... bench:          62.71 ns/iter (+/- 0.60)
test bench_get_1000    ... bench:          62.35 ns/iter (+/- 0.63)
test bench_get_m_1     ... bench:          79.28 ns/iter (+/- 2.05)
test bench_get_m_10    ... bench:         101.78 ns/iter (+/- 10.46)
test bench_insert_100  ... bench:         119.64 ns/iter (+/- 3.92)
test bench_insert_1000 ... bench:         119.94 ns/iter (+/- 3.78)
test bench_insert_m_1  ... bench:         127.84 ns/iter (+/- 2.51)
test bench_insert_m_10 ... bench:         183.98 ns/iter (+/- 33.69)

to

With floor()
test bench_get_100     ... bench:          55.19 ns/iter (+/- 0.90)
test bench_get_1000    ... bench:          57.16 ns/iter (+/- 0.49)
test bench_get_m_1     ... bench:          71.36 ns/iter (+/- 0.70)
test bench_get_m_10    ... bench:          94.81 ns/iter (+/- 17.49)
test bench_insert_100  ... bench:         105.73 ns/iter (+/- 7.65)
test bench_insert_1000 ... bench:         105.82 ns/iter (+/- 3.88)
test bench_insert_m_1  ... bench:         115.46 ns/iter (+/- 3.02)
test bench_insert_m_10 ... bench:         162.82 ns/iter (+/- 18.40)
jedisct1 commented 3 weeks ago

Shouldn't the number of hash functions be rounded instead?