justcoding121 / advanced-algorithms

100+ algorithms & data structures generically implemented in C#
MIT License
1.3k stars 288 forks source link

BloomFilter with string doesn't work with .Net Core #36

Closed sebfischer83 closed 2 years ago

sebfischer83 commented 3 years ago

Hi,

BloomFilter uses GetHashCode, which yields for strings no deterministic value. This means every test returns true.

     Advanced.Algorithms.DataStructures.BloomFilter<string> bloomFilter = new Advanced.Algorithms.DataStructures.BloomFilter<string>(10);
            bloomFilter.AddKey("foo");
            bloomFilter.AddKey("bar");
            bloomFilter.AddKey("apple");
            bloomFilter.AddKey("orange");
            bloomFilter.AddKey("banana");

            bloomFilter.KeyExists("bar").ShouldBe(true);
            bloomFilter.KeyExists("ba111r").ShouldBe(true); // is true
            bloomFilter.KeyExists("banana").ShouldBe(true);
            bloomFilter.KeyExists("dfs11j").ShouldBe(false);`// is also true
justcoding121 commented 2 years ago

Sorry for the delayed response. I don't think this is a bug.

When the bloom filter returns false for KeyExists, we can trust 100% that it doesn't exist. But if it returns true for KeyExists, it is not guaranteed that the key exists. That is by design. https://medium.datadriveninvestor.com/bloom-filter-a-simple-but-interesting-data-structure-37fd53b11606 https://en.wikipedia.org/wiki/Bloom_filter

In your example bloomFilter.KeyExists("ba111r") returns true, though it is not guaranteed that it is true. Similarly, bloomFilter.KeyExists("dfs11j") returns true, though it is not guaranteed that it is true. Increasing the size of the bloom filter and playing with the number of hash functions can reduce the probabilty of false positives. But if bloomFilter.KeyExists("dfs11j") or bloomFilter.KeyExists("dfs11j") ever returned false, we can trust that 100%.

In my latest change, I added the ability to change the number of hash functions, by default it is 2.

sebfischer83 commented 2 years ago

Hi, sorry I think I was a bit unclear. For string the hashcode ist computed by

private IEnumerable<int> getHashes(T key)
        {
            for (var i = 1; i <= numberOfHashFunctions; i++)
            {
                var obj = new { Key = key, InitialValue = i };
                yield return Math.Abs(obj.GetHashCode());
            }
        }

When I now, save my filter and will load it again on another run it won't work because since Core the HashCode for a string changes on every run of the program.

justcoding121 commented 2 years ago

What I used as a hash function is pretty simplistic, which likely uses the memory location of the string to compute the hash. I created this library just to demonstrate it in its most simplified form.

If you would like to serialize the Bloom Filter and load it again on process restart, you may need to use some other way of hashing it, that would persist across process restarts. Or you can reload from an empty bloom filter each time program starts, by adding the keys again one by one, which may not be efficient.