jasondavies / bloomfilter.js

JavaScript bloom filter using FNV for fast hashing
http://www.jasondavies.com/bloomfilter/
BSD 3-Clause "New" or "Revised" License
759 stars 79 forks source link

Too many false positives #15

Closed adamhooper closed 6 years ago

adamhooper commented 9 years ago

Edited by @jasondavies: looks like we need to review the use of double-FNV.

adamhooper commented 9 years ago

Ooh excuse me. It looks like I misunderstood what fnv_1a_b does. This issue description and title are completely wrong :)....

But I am seeing tons of false positives. I threw 150 Strings at my bloom filter and it gave at least five false positives. n=12,148,218; m=425,071,000; k=24. (Should be p=5e-7, but I've spotted 3e-2 and there are surely others.)

adamhooper commented 9 years ago

I've tried hacking in fnv_1() to use as the second hash. It eliminates some false positives, but not others.

jasondavies commented 9 years ago

Perhaps you could post a reproducible example? I’m having trouble understanding what the problem is exactly…

any particular value of a will always produce the same value of b

I might be missing something, but I’m pretty sure this isn’t true; a trivial check in the bundled tests shows that a and b are always different.

adamhooper commented 9 years ago

Yeah, I clearly jumped the gun on this issue. (And thanks for such a quick reply!) I can produce a reproducible example, but I'm not quite 100% on the details. It'll be ~100MB.

I only have two factoids so far, but I'll update this issue when I learn more. So far:

From these findings, my hunch is that the hashing function is far worse than it should be. (I don't see why, though.) I tried throwing in xxhashjs, but it's far, far too slow.

At this amount of data, each step of testing takes a while. Tomorrow I can send you a ~100MB reduction. That should demonstrate how many false positives we we should see and how many there actually are.

jasondavies commented 9 years ago

Yes, after further investigation it does look like the double-FNV approach isn’t quite right. I would advise swapping in a second hash function to calculate b, rather than the current approach. I’ll update the library when I get time.

pchaigno commented 8 years ago

@jasondavies Is there a fix planned for this?

adamhooper commented 8 years ago

@pchaigno Back when I had this problem, I solved it in a different project, C++-based. That's at https://github.com/overview/overview-js-bloom-filter. Hasn't been updated in a while, but as long as you're not using a web browser, it should do the trick. It uses a better hashing function to avoid this issue.

pchaigno commented 8 years ago

Ooh nice! Thanks @adamhooper! I'll switch if @jasondavies doesn't answer :p

syzer commented 7 years ago

:+1: