Open jutilan opened 7 years ago
A quick test shows calling the hash a second time (with a hard-coded seed) and adding these two additional values at staggered intervals (e.g. every 2 and every 3 iterations) drastically improves the accuracy
This may also drastically slow it down.
Problems come when the step
computed as LongMath.mod(hash2, bitSize)
leads to a small period for a given bitSize
(the computation is different, but equivalent). It's worst when step == 0
, which has a probability of 1.0 / bitSize
, which is too bad for small bit sizes. As the OP writes, step == 32
and bitSize == 64
is pretty bad, too. Some minor problems probably occur also in cases like step == 3
and bitSize == 192
, where the period is 64 (For this bit size, the step being any multiple of three is that bad or worse; e.g., step == 24
leads to a period of 8).
There are no problems when step
and bitSize
are co-prime. The simplest way to achieve this is to round bitSize
up to the next power of two and ensure that hash2
is odd. This also makes the filter significantly faster by replacing the modulus by masking. The price is consuming up to twice as much memory with the additional gain of better than requested fpp
.
Even without changing bitSize
, ensuring that hash2
is odd by using long hash2 = upperEight(bytes) | 1
sounds like a good idea as it guarantees a period of at least 64.
A funny idea could be using combinedHash += hash2 + i;
in line 106.
I have encountered this problem as well. A fix would be much appreciated šš»
Bloom Filters created with less than 500 expected insertions and a fpp of 1e-7 do not achieve that desired false positive probability in practice. This issue is most noticeable at a low number of insertions and slowly disappears as they increase towards 500.
Testing details: 300,000 random unique strings of length 14 were generated for each test iteration Bloom Filter created using "Funnels.stringFunnel(StandardCharsets.UTF_8)" Desired false positive probability of 1e-7 Tested for expected number of insertions from 1 to 2000 Each test rerun multiple times First n strings added to Bloom Filter Entire string set tested by calling mightContain() Verified first n were true Counted how many of the remaining strings returned true
Results: Average number of false positives for 1 insertions was around 5000 (1.6e-2) Average number of false positives for 2 insertions was around 1500 (5e-3) False positives dropped to 0 around 500 insertions
Analysis: For the most drastic case when the number of expected insertions is 1, the number of optimal bits is ~34 and the optimal number of hashes is ~23. A peculiarity of this implementation is that the number of bits actually used is always a multiple of 64, so in this case 64 bits is actually used (see BloomFilterStrategies.java line 48 where the bitSize is set to bits.bitSize() rather than the computed number of bits). This means there is a 1/64 chance that the second hash is a multiple of 64 and hence does not affect the combinedHash when incrementing. When this happens, every bit index computed is the same index. If we take a case where a single insertion flips all 23 bits possible, the probability of a false positive is at least (1/64)*(23/64) = 5.6e-3 regardless of the fpp requested. As the number of expected insertions increases, the number of bits does as well which lessens the chance of a hash producing only 1 unique bit index. Note that this happens for all powers of 2 though (i.e. hashes can produce only 2 unique bit indices, and 4, and so on...).
Review of the implementation suggests adding the modded hash2 value is not ideal for a "low" number of bits. A quick test shows calling the hash a second time (with a hard-coded seed) and adding these two additional values at staggered intervals (e.g. every 2 and every 3 iterations) drastically improves the accuracy at low number of insertions by avoiding such a high chance of so many duplicate indices.