vla / BloomFilter.NetCore

A bloom filter implementation
MIT License
147 stars 38 forks source link

Number of Hash Functions Design Question #3

Closed ChiefInnovator closed 1 year ago

ChiefInnovator commented 3 years ago

My understanding is that Bloom Filters work on the premise of using multiple hash functions (i.e., k = # of hash functions) (see https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions). To add or test an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1 or check the positions are set to 1. If they are set to 1, the item maybe in the set. If any of the bits are 0, then the item is not in the set.

The implementation of Filter.cs only uses a single HashFunction. Why is that? While you can do this with a single hash function, the implementation should have the ability to use 1 or more hash functions.

To be put another way, the Bloom Filter requires "different" hash functions. There must be k different hash functions defined.

There is a discussion of using a hash function by passing k different initial values to a hash function. I am assuming that is what you are doing. Please help me confirm this.

"Alternatively, one can pass k different initial values (such as 0, 1, ..., k − 1) to a hash function that takes an initial value; or add (or append) these values to the key" from WIKIPEDIA below.

*** FROM WIKIPEDIA **** The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger m and/or k, independence among the hash functions can be relaxed with negligible increase in false positive rate.[3] (Specifically, Dillinger & Manolios (2004b) show the effectiveness of deriving the k indices using enhanced double hashing or triple hashing, variants of double hashing that are effectively simple random number generators seeded with the two or three hash values.) "

vla commented 3 years ago

Thanks for your interest in this project.

If you're interested, you can read this PDF Kirsch & Mitzenmacher (2006)