Closed adamhooper closed 6 years ago
Ooh excuse me. It looks like I misunderstood what fnv_1a_b does. This issue description and title are completely wrong :)....
But I am seeing tons of false positives. I threw 150 Strings at my bloom filter and it gave at least five false positives. n=12,148,218; m=425,071,000; k=24. (Should be p=5e-7, but I've spotted 3e-2 and there are surely others.)
I've tried hacking in fnv_1()
to use as the second hash. It eliminates some false positives, but not others.
Perhaps you could post a reproducible example? I’m having trouble understanding what the problem is exactly…
any particular value of
a
will always produce the same value ofb
I might be missing something, but I’m pretty sure this isn’t true; a trivial check in the bundled tests shows that a and b are always different.
Yeah, I clearly jumped the gun on this issue. (And thanks for such a quick reply!) I can produce a reproducible example, but I'm not quite 100% on the details. It'll be ~100MB.
I only have two factoids so far, but I'll update this issue when I learn more. So far:
fnv_1a_b()
call with a new fnv_1()
call (which is basically a cut-and-paste job), I see significantly fewer false positives. But it's still far too many.From these findings, my hunch is that the hashing function is far worse than it should be. (I don't see why, though.) I tried throwing in xxhashjs, but it's far, far too slow.
At this amount of data, each step of testing takes a while. Tomorrow I can send you a ~100MB reduction. That should demonstrate how many false positives we we should see and how many there actually are.
Yes, after further investigation it does look like the double-FNV approach isn’t quite right. I would advise swapping in a second hash function to calculate b
, rather than the current approach. I’ll update the library when I get time.
@jasondavies Is there a fix planned for this?
@pchaigno Back when I had this problem, I solved it in a different project, C++-based. That's at https://github.com/overview/overview-js-bloom-filter. Hasn't been updated in a while, but as long as you're not using a web browser, it should do the trick. It uses a better hashing function to avoid this issue.
Ooh nice! Thanks @adamhooper! I'll switch if @jasondavies doesn't answer :p
:+1:
Edited by @jasondavies: looks like we need to review the use of double-FNV.