Open davidecaroselli opened 6 years ago
Funny! It looks like that for a given value, occasionally, one thread succeeds to set some bits while the other sets other bits, so both get true
back. This happens rather often (e.g., "Expected count <=10000 but found 10892"), probably because of the thread currently being behind doing just fast get
instead of the slower compareAndSet
(and then another CAS in bitCount.increment
).
The javadoc says
return true if the Bloom filter's bits changed as a result of this operation. If the bits changed, this is definitely the first time
object
has been added to the filter.
and the word "definitely" seems to be an overstatement in case of concurrent operations.
I guess, the "fix" is to change the javadoc, as the return value of put
is probably rarely used. I'm afraid, anything else would make the filter slower.
One (stupid) idea would be to detect concurrent writes by reading bitCount
at the very beginning of put
, comparing it at the end with the value it should have (initial value plus the number of times LockFreeBitArray#set
returned true), and returning false on mismatch. In case of two thread inserting a different value each, this would produce a false positive, which is allowed. But this won't really work with the current implementation as bitCount
gets incremented after data
gets modified.
I guess, the "fix" is to change the javadoc, as the return value of put is probably rarely used.
Well.. this is actually the main use-case if you want to filter out duplicates in a data stream.. and I see a lot of cases in which this can be the "primary" goal, or am I missing something?
As far as I understand, it seems to me that the part of code that produces the bug is this one:
boolean bitsChanged = false;
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
combinedHash += hash2;
}
return bitsChanged;
So the solution should be to "collect" all 1s positions and then apply them atomically. I truly believe that the best fix should be to make it thread-safe, even if this means worse speed performance.
If BloomFilter is thread-safe, it must be thread-safe; make it a mid-way solution will probably lead many users to do terrible things for not knowing how to use the lib... just my 2 cents :)
Cheers, Davide
Edit: .. or at least make it extremely clear and offer a solution to the user with a second method, like this:
public synchronized boolean syncPut(T object) {
return put(object);
}
Anyone willing to send a PR?
This is already thread-safe based on the API that Bloom filters are generally advertised as having. The current implementation already guarantees that if put(elem) happens-before maybeContains(elem), then maybeContains always returns true. Additionally, the reason we changed this in the first place is that there are users out there who need the performance benefits of concurrent puts and maybeContains calls as currently written.
I would strongly prefer weakening the Javadoc on the return value of put.
I guess, the "fix" is to change the javadoc, as the return value of put is probably rarely used.
Well.. this is actually the main use-case if you want to filter out duplicates in a data stream.. and I see a lot of cases in which this can be the "primary" goal, or am I missing something?
I guess, I was missing something (though I wonder that your filtering can't tolerate occasional duplicates while it can tolerate lost items?).
So the solution should be to "collect" all 1s positions and then apply them atomically.
Whatever you "collect", you can't apply it atomically without locking. Typically with n
hash functions, you use n
different words; there are (more cache-efficient) variants packing multiple bits per word. Packing everything into a single word (so it can be CAS-ed) sounds pretty extreme.
If BloomFilter is thread-safe, it must be thread-safe; make it a mid-way solution will probably lead many users to do terrible things for not knowing how to use the lib
Good point, but BF is documented to be both thread-safe and lock-free.
This BF (taken from Cassandra) solves the problem in a trivial way: https://github.com/ifesdjeen/blomstre/blob/master/src/com/ifesdjeen/blomstre/BloomFilter.java#L67
I'm not sure I see a thread-safe, lock-free solution that returns a meaningful value from put
. We might be in a position of being forced to break one of those API specifications.
My understanding was that thread-safe
and lock-free
in the JavaDoc does not guarantee that the entire operation is atomic. This allowed the small design change in #2748 to be racy, assuming that the results were overall okay.
The alternative that I proposed was more invasive, by using a Bloom-1 design. If redesigning the implementation to target concurrency, then I think it would be better to adopt that. But the goal was to expand usage to allow best effort concurrency, so the implementation change seemed reasonable as is.
Hello team!
I just found a problem using
com.google.common.hash.BloomFilter
with more than one thread. Long story short:put()
is not atomic so I am able to reproduce a false negative.This is the config in
pom.xml
:I created a simple Test.java class that proves the point. The fix is also included in the file: just uncomment lines 56 and 58 (make
put()
synchronized).Looking at the
BloomFilter
javadoc I see:So I suppose that I'm not doing anything "wrong". Can you please confirm the issue or tell me where I'm using the library wrong?
Thanks!