jaybaird / python-bloomfilter

Scalable Bloom Filter implemented in Python
MIT License
1.62k stars 330 forks source link

Improve hashing performance #7

Closed glangford closed 11 years ago

glangford commented 11 years ago

The idea in this change is to optimize hashing for the common case where only a single hash digest is required (very common even for very high capacities). Avoids looping, replaces list creation/extend/delete with a list comprehension, and avoids unnecessary modulus calculations.

jaybaird commented 11 years ago

It's probably better to modify _make_hashfuncs to do this, instead of adding the extra helper methods. Also it's helpful to add test cases that exercise these specific branches as well.

glangford commented 11 years ago

Unless I am missing something - I'm not sure performance can be a win putting it into one function. Anything that goes into _make_hashfuncs has to be performed for every filter add or membership test, so it would be slow.

etrepum commented 11 years ago

Why not just break in the make_hashfuncs loop if len(rval) >= num_slices?

glangford commented 11 years ago

Why not just break in the make_hashfuncs loop if len(rval) >= num_slices?

Because that test would have to be performed for every hash. The code that tests for num_salts==1, sets fmt and trim_digest, etc. to set up the closure - it is only performed once, not repeated for every add or membership test.

etrepum commented 11 years ago

Well, it's most of the win (small constant factor difference) with almost no cruft. The performance difference between the two is probably even hard to measure.

etrepum commented 11 years ago

Other ideas for larger performance enhancements would be using a linear combination of two hash functions http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf and using a faster hash algorithm such as SipHash http://cr.yp.to/siphash/siphash-20120727.pdf

glangford commented 11 years ago

Kirsch and Mitzenmacher isn't needed if one hash function has a big enough digest. But I agree in principle, I was looking for an easy win.

I also agree that my proposed change has a bit of cruft...however, the overall performance win is about 1% per line of code.... :) I see 25-30% improvement (including the 2X change to the BloomFilter size). So on that basis I proposed it. The goal was to make the Python bit as lean as possible - and of course if the underlying hash is quick too, that is even better.