mavam / libbf

:dart: Bloom filters for C++11
http://mavam.github.io/libbf
BSD 3-Clause "New" or "Revised" License
354 stars 89 forks source link

Avoid assert failure on bits_.size() % digests.size() #18

Closed RoyBellingan closed 4 years ago

RoyBellingan commented 8 years ago

In my test using very large number (around 100M record) happens to have say required_cells = 500.000 optimal_k = 7

So i just change required_cells to 500.003... (required_cells += required_cells%optimal_k)

I have done some testing and seems that the property of the filter are kept also from what I understood from the underlying math there will be no problem at all, just a "waste" of 3 bit...

I do not have the code with me, if you think is a safe move I`ll send it. Bye

stmotw commented 4 years ago

Hello, @mavam, I have the same problem.

Here's problematic part in basic bloom filter: libbf/src/bloom_filter/basic.cpp

basic_bloom_filter::basic_bloom_filter(double fp, size_t capacity, size_t seed,
                                       bool double_hashing, bool partition)
    : partition_(partition) {
  auto required_cells = m(fp, capacity);
  auto optimal_k = k(required_cells, capacity);
  bits_.resize(required_cells);
  hasher_ = make_hasher(optimal_k, seed, double_hashing);
}

void basic_bloom_filter::add(object const& o) {
  auto digests = hasher_(o);
  assert(bits_.size() % digests.size() == 0);
  if (partition_) {
    auto parts = bits_.size() / digests.size();
    for (size_t i = 0; i < digests.size(); ++i)
      bits_.set(i * parts + (digests[i] % parts));
  } else {
    for (auto d : digests)
      bits_.set(d % bits_.size());
  }
}

partition is true by default, but methods to get required_cells and optimal_k does not guarantee that required_cells % optimal_k == 0. And even if partition is false in debug build assert will fail.

The proposed solution is to align the size of the bits and use assert only if the partition is true.

I've made a pull request for this: https://github.com/mavam/libbf/pull/49

stmotw commented 4 years ago

This issue is resolved now.