Closed sebfischer83 closed 2 years ago
Sorry for the delayed response. I don't think this is a bug.
When the bloom filter returns false
for KeyExists
, we can trust 100% that it doesn't exist.
But if it returns true
for KeyExists
, it is not guaranteed that the key exists.
That is by design.
https://medium.datadriveninvestor.com/bloom-filter-a-simple-but-interesting-data-structure-37fd53b11606
https://en.wikipedia.org/wiki/Bloom_filter
In your example bloomFilter.KeyExists("ba111r")
returns true
, though it is not guaranteed that it is true
. Similarly, bloomFilter.KeyExists("dfs11j")
returns true
, though it is not guaranteed that it is true
. Increasing the size of the bloom filter and playing with the number of hash functions can reduce the probabilty of false positives. But if bloomFilter.KeyExists("dfs11j")
or bloomFilter.KeyExists("dfs11j")
ever returned false
, we can trust that 100%.
In my latest change, I added the ability to change the number of hash functions, by default it is 2.
Hi, sorry I think I was a bit unclear. For string the hashcode ist computed by
private IEnumerable<int> getHashes(T key)
{
for (var i = 1; i <= numberOfHashFunctions; i++)
{
var obj = new { Key = key, InitialValue = i };
yield return Math.Abs(obj.GetHashCode());
}
}
When I now, save my filter and will load it again on another run it won't work because since Core the HashCode for a string changes on every run of the program.
What I used as a hash function is pretty simplistic, which likely uses the memory location of the string to compute the hash. I created this library just to demonstrate it in its most simplified form.
If you would like to serialize the Bloom Filter and load it again on process restart, you may need to use some other way of hashing it, that would persist across process restarts. Or you can reload from an empty bloom filter each time program starts, by adding the keys again one by one, which may not be efficient.
Hi,
BloomFilter uses GetHashCode, which yields for strings no deterministic value. This means every test returns true.