Closed glangford closed 11 years ago
It's probably better to modify _make_hashfuncs to do this, instead of adding the extra helper methods. Also it's helpful to add test cases that exercise these specific branches as well.
Unless I am missing something - I'm not sure performance can be a win putting it into one function. Anything that goes into _make_hashfuncs has to be performed for every filter add or membership test, so it would be slow.
Why not just break in the make_hashfuncs loop if len(rval) >= num_slices?
Why not just break in the make_hashfuncs loop if len(rval) >= num_slices?
Because that test would have to be performed for every hash. The code that tests for num_salts==1, sets fmt and trim_digest, etc. to set up the closure - it is only performed once, not repeated for every add or membership test.
Well, it's most of the win (small constant factor difference) with almost no cruft. The performance difference between the two is probably even hard to measure.
Other ideas for larger performance enhancements would be using a linear combination of two hash functions http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf and using a faster hash algorithm such as SipHash http://cr.yp.to/siphash/siphash-20120727.pdf
Kirsch and Mitzenmacher isn't needed if one hash function has a big enough digest. But I agree in principle, I was looking for an easy win.
I also agree that my proposed change has a bit of cruft...however, the overall performance win is about 1% per line of code.... :) I see 25-30% improvement (including the 2X change to the BloomFilter size). So on that basis I proposed it. The goal was to make the Python bit as lean as possible - and of course if the underlying hash is quick too, that is even better.
The idea in this change is to optimize hashing for the common case where only a single hash digest is required (very common even for very high capacities). Avoids looping, replaces list creation/extend/delete with a list comprehension, and avoids unnecessary modulus calculations.