google / guava

Google core libraries for Java
Apache License 2.0
50.03k stars 10.86k forks source link

BloomFilter multithread problem: put() NOT atomic #3055

Open davidecaroselli opened 6 years ago

davidecaroselli commented 6 years ago

Hello team!

I just found a problem using com.google.common.hash.BloomFilter with more than one thread. Long story short: put() is not atomic so I am able to reproduce a false negative.

This is the config in pom.xml:

<dependency>
   <groupId>com.google.guava</groupId>
   <artifactId>guava</artifactId>
   <version>24.0-jre</version>
</dependency>

I created a simple Test.java class that proves the point. The fix is also included in the file: just uncomment lines 56 and 58 (make put() synchronized).

Looking at the BloomFilter javadoc I see:

 * As of Guava 23.0, this class is thread-safe and lock-free. It internally uses atomics and
 * compare-and-swap to ensure correctness when multiple threads are used to access it.

So I suppose that I'm not doing anything "wrong". Can you please confirm the issue or tell me where I'm using the library wrong?

Thanks!

Maaartinus commented 6 years ago

Funny! It looks like that for a given value, occasionally, one thread succeeds to set some bits while the other sets other bits, so both get true back. This happens rather often (e.g., "Expected count <=10000 but found 10892"), probably because of the thread currently being behind doing just fast get instead of the slower compareAndSet (and then another CAS in bitCount.increment).

The javadoc says

return true if the Bloom filter's bits changed as a result of this operation. If the bits changed, this is definitely the first time object has been added to the filter.

and the word "definitely" seems to be an overstatement in case of concurrent operations.

I guess, the "fix" is to change the javadoc, as the return value of put is probably rarely used. I'm afraid, anything else would make the filter slower.


One (stupid) idea would be to detect concurrent writes by reading bitCount at the very beginning of put, comparing it at the end with the value it should have (initial value plus the number of times LockFreeBitArray#set returned true), and returning false on mismatch. In case of two thread inserting a different value each, this would produce a false positive, which is allowed. But this won't really work with the current implementation as bitCount gets incremented after data gets modified.

davidecaroselli commented 6 years ago

I guess, the "fix" is to change the javadoc, as the return value of put is probably rarely used.

Well.. this is actually the main use-case if you want to filter out duplicates in a data stream.. and I see a lot of cases in which this can be the "primary" goal, or am I missing something?

As far as I understand, it seems to me that the part of code that produces the bug is this one:

boolean bitsChanged = false;
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
   // Make the combined hash positive and indexable
   bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
   combinedHash += hash2;
}
return bitsChanged;

So the solution should be to "collect" all 1s positions and then apply them atomically. I truly believe that the best fix should be to make it thread-safe, even if this means worse speed performance.

If BloomFilter is thread-safe, it must be thread-safe; make it a mid-way solution will probably lead many users to do terrible things for not knowing how to use the lib... just my 2 cents :)

Cheers, Davide

Edit: .. or at least make it extremely clear and offer a solution to the user with a second method, like this:

public synchronized boolean syncPut(T object) {
    return put(object);
}
kluever commented 6 years ago

Anyone willing to send a PR?

lowasser commented 6 years ago

This is already thread-safe based on the API that Bloom filters are generally advertised as having. The current implementation already guarantees that if put(elem) happens-before maybeContains(elem), then maybeContains always returns true. Additionally, the reason we changed this in the first place is that there are users out there who need the performance benefits of concurrent puts and maybeContains calls as currently written.

I would strongly prefer weakening the Javadoc on the return value of put.

Maaartinus commented 6 years ago

I guess, the "fix" is to change the javadoc, as the return value of put is probably rarely used.

Well.. this is actually the main use-case if you want to filter out duplicates in a data stream.. and I see a lot of cases in which this can be the "primary" goal, or am I missing something?

I guess, I was missing something (though I wonder that your filtering can't tolerate occasional duplicates while it can tolerate lost items?).

So the solution should be to "collect" all 1s positions and then apply them atomically.

Whatever you "collect", you can't apply it atomically without locking. Typically with n hash functions, you use n different words; there are (more cache-efficient) variants packing multiple bits per word. Packing everything into a single word (so it can be CAS-ed) sounds pretty extreme.

If BloomFilter is thread-safe, it must be thread-safe; make it a mid-way solution will probably lead many users to do terrible things for not knowing how to use the lib

Good point, but BF is documented to be both thread-safe and lock-free.

This BF (taken from Cassandra) solves the problem in a trivial way: https://github.com/ifesdjeen/blomstre/blob/master/src/com/ifesdjeen/blomstre/BloomFilter.java#L67

lowasser commented 6 years ago

I'm not sure I see a thread-safe, lock-free solution that returns a meaningful value from put. We might be in a position of being forced to break one of those API specifications.

ben-manes commented 6 years ago

My understanding was that thread-safe and lock-free in the JavaDoc does not guarantee that the entire operation is atomic. This allowed the small design change in #2748 to be racy, assuming that the results were overall okay.

The alternative that I proposed was more invasive, by using a Bloom-1 design. If redesigning the implementation to target concurrency, then I think it would be better to adopt that. But the goal was to expand usage to allow best effort concurrency, so the implementation change seemed reasonable as is.