mavam / libbf

:dart: Bloom filters for C++11
http://mavam.github.io/libbf
BSD 3-Clause "New" or "Revised" License
354 stars 89 forks source link

Possible performance issue when performing look-ups for non-existent entries #21

Open ArashPartow opened 7 years ago

ArashPartow commented 7 years ago

In the method _'basic_bloomfilter::lookup' digests are computed first then subsequently quantized and looked up in the filter.

https://github.com/mavam/libbf/blob/master/src/bf/bloom_filter/basic.cc#L60

Implementations typically for efficiency purposes will have the look-up perform a trivial exit on the first digest that encounters a miss - as computing the hash may cost more than the lookup itself.

This will probably only be a problem when there is a larger expectation for queries/look-ups of non-existent entries versus present/existent ones.

mavam commented 7 years ago

Thanks for taking a look at the internals.

My first question: Have You Measured It? If the profiler confirms your intuition, then we're onto something. But micro-optimization usually only makes sense once you have empirical evidence that the critical path needs tuning. For example, I could equally see an issue that we're stressing the allocator by computing the hashes in a vector in the first place. But without data confirming this hypothesis, I wouldn't spend any cycles proactively fixes low-impact code paths.

ArashPartow commented 7 years ago

Have You Measured It?

Yes I have. The use-case is large objects (eg: strings) +32bytes, that cause a greater hash computation time. When an object does not exist in the Bloom filter the expectation (mean) is that it will be known by at least the k/2'th hash/lookup. However in the current situation it will compute all k hashes before attempting to even lookup the bits, so it will be performing 2x the required expected amount of hashing, and presumably hashing will be a much larger portion of the total operation time when compared to looking up the state of a bit in an array.

Here is an example: https://github.com/lynxoid/bloom_compare/blob/master/src/vallentine.cpp#L46


that we're stressing the allocator

Given the design both solutions will be exposed to that particular inefficiency, but yeah that leads to another problem, with regards to how the hasher operates. When either dealing with a small number of hashes or data that is small enough that it can be hashed inline (eg: 2, 8 bytes) it would make sense to use the current strategy, as the loop can be trivially unrolled. Unfortunately the Bloom filter would not necessarily know about the rules or policies the hasher would have incorporated.

A possible approach would be to have a hasher that takes bits_ and the object from the Bloom filter and returns true/false/void based on the operation, and let the hasher make the decisions. That would somewhat resolve the allocation issue of digests because it could simply be a member local to the hasher instance, but now you've got the problem that the lookup method is not thread-safe anymore.

mavam commented 7 years ago

Yes I have.

Great. Thanks for sharing the results!

An easy way would involve using double hashing. In this way, we compute exactly two hash functions and use a linear combination to produce more values if needed. In your current code, you use the default hasher which doesn't have that optimization. The make_hasher function has a third argument to enable double hashing. (The API isn't that pretty, but I hope it works for now.)

Another solution would involve lazy computation of hash values. Once I finally will get some time to work more on this library, I'll try to add this to the N3980 branch, which will contain the new hashing API.

Unfortunately the Bloom filter would not necessarily know about the rules or policies the hasher would have incorporated.

Exactly. In particular, the number of hash functions is a runtime parameter, hence the vector representation of hash digests.