Closed RoyBellingan closed 4 years ago
Hello, @mavam, I have the same problem.
Here's problematic part in basic bloom filter: libbf/src/bloom_filter/basic.cpp
basic_bloom_filter::basic_bloom_filter(double fp, size_t capacity, size_t seed,
bool double_hashing, bool partition)
: partition_(partition) {
auto required_cells = m(fp, capacity);
auto optimal_k = k(required_cells, capacity);
bits_.resize(required_cells);
hasher_ = make_hasher(optimal_k, seed, double_hashing);
}
void basic_bloom_filter::add(object const& o) {
auto digests = hasher_(o);
assert(bits_.size() % digests.size() == 0);
if (partition_) {
auto parts = bits_.size() / digests.size();
for (size_t i = 0; i < digests.size(); ++i)
bits_.set(i * parts + (digests[i] % parts));
} else {
for (auto d : digests)
bits_.set(d % bits_.size());
}
}
partition
is true by default, but methods to get required_cells
and optimal_k
does not guarantee that required_cells % optimal_k == 0
. And even if partition
is false in debug build assert will fail.
The proposed solution is to align the size of the bits
and use assert only if the partition
is true.
I've made a pull request for this: https://github.com/mavam/libbf/pull/49
This issue is resolved now.
In my test using very large number (around 100M record) happens to have say required_cells = 500.000 optimal_k = 7
So i just change required_cells to 500.003... (required_cells += required_cells%optimal_k)
I have done some testing and seems that the property of the filter are kept also from what I understood from the underlying math there will be no problem at all, just a "waste" of 3 bit...
I do not have the code with me, if you think is a safe move I`ll send it. Bye