Closed GoogleCodeExporter closed 9 years ago
Hypothetical use cases:
1. Wikipedia uses an example of a spell-checker; if you put legal words in a
Bloom
filter, you will be sure not to call something an error that isn't, but you
could
miss flagging certain misspellings on occasion.
2. A "screen" so that you can optimize away an expensive rpc/db query in case
the
Bloom filter gives you a definitive "no".
3. Suppose you want to display a special message to customers that are in your
special set of "most valued". But you don't mind if that message also gets
shown to
some other folks as well. You could put all your valued customers into a
BloomFilter<Customer>, and whenever this filter "mightContain" the logged-in
customer, show the message.
There may be practical reasons why a Bloom filter is or isn't a good choice in
any of
these situations, but hopefully these are useful as illustrations.
Note: a Bloom filter requires good hashing to work properly, not just a crummy
32-bit
hash code from Object.hashCode(). So this enhancement request depends on us
providing some better support for hashing.
Original comment by kevin...@gmail.com
on 30 Oct 2007 at 11:06
Original comment by kevin...@gmail.com
on 3 Nov 2007 at 5:33
Original comment by kevin...@gmail.com
on 2 Jun 2008 at 5:48
Original comment by kevin...@gmail.com
on 17 Sep 2009 at 6:02
I've just taken a look at the interface and thought I'd respond with some
initial
thoughts.
There is another method of the Collection interface that could be usefully
adopted:
boolean mightContainAll(Collection<?> c) //equivalent of containsAll()
Implementing size() is a possibility, by counting the number of adds that
mutate the
filter, but probably not worthwhile; it should be possible (better actually) to
compute the falsePositiveProbability without this count and to use the number
of bits
set within the filter instead.
Extending the Collection interface directly is of little benefit because the
iterator() method cannot be supported; such a Collection implementation would
interact very poorly with any of the standard collection classes. Where it
would
provide a small benefit is that the BloomFilter interface could avoid
overloading
'batch' operations with versions that take another bloom filter:
boolean addAll(BloomFilter<E> filter)
boolean mightContainAll(BloomFilter<E> filter) //containsAll would be a more
accurate
in this case
These last two operations are clearly only defined over compatible filters, but
are
important because they are very efficient. But the conditions under which two
BloomFilters are compatible is not a straightforward issue since the filters
must
match in capacity (number of bits) and in their hashing. Accessors over these
would
enable two different BloomFilter implementations to determine if they are
compatible
to each other. For example:
int getCapacity()
Hash getHash() //serializablity requires hash generators be serializable also
And this is not sufficient because access to the actual bitset would also be
necessary, it might be necessary to replace the capacity accessor with a method
that
exposes the filter's state through a specific bitset object:
Bitset getBits()
Whether this Bitset would be an interface or concrete I find difficult to
determine.
What are the benefits of all this?
* The BloomFilter 'union' operation (addAll) and 'subset' test (containsAll) become
possible.
* Exposure of the filter bits allows an application to persist the filter outside of
standard java serialization.
* Equality of BloomFilters becomes defined (as equality of bitset and hash).
* BloomFilter.hashCode() can be specified too.
Whether this complexity is worthwhile is an open question, but my view is that,
if
you are looking to introduce BloomFilter as an interface rather than a concrete
class, it's necessary to consider the conditions under which the instances of
two
different BloomFilter implementations are equal (as the java.util package
attempts to
do for collections). I also know that the first two capabilities in this list
have
been vital in one of my applications.
Original comment by tomgibara
on 17 Mar 2010 at 10:34
Original comment by kevinb@google.com
on 30 Jul 2010 at 3:53
Original comment by kevinb@google.com
on 30 Jul 2010 at 3:56
Support for set operations (determining the union, intersection or difference
without exposing bitsets) over bloom filters would be useful.
Original comment by reachb...@gmail.com
on 11 Aug 2010 at 11:50
Original comment by kevinb@google.com
on 27 Jan 2011 at 1:54
Original comment by kevinb@google.com
on 27 Jan 2011 at 1:59
A very useful data structure indeed.
Original comment by jon.h.clark
on 7 Feb 2011 at 5:31
Original comment by kevinb@google.com
on 13 Jul 2011 at 6:18
Original comment by kevinb@google.com
on 16 Jul 2011 at 8:37
There has been a lot of progress on this and we may see it in release 12, to
put a guess out there.
Original comment by kevinb@google.com
on 19 Jul 2011 at 2:24
Any chance we could get this into Git, if not exposed, so I can play with it?
Original comment by wasserman.louis
on 6 Dec 2011 at 4:09
Since recently it became more or less feature complete (i.e. serialization), I
wouldn't mind exposing it as-is in @Beta. Kevin?
Original comment by jim.andreou
on 6 Dec 2011 at 5:10
Seems like there is no roadblock to making this public (@Beta of course), along
with the new hashing abstractions. Just a matter of someone actually doing it
Original comment by jim.andreou
on 7 Dec 2011 at 2:01
I get the impression folks are really busy, though, so I'd settle for
open-sourcing it without exposing the API yet.
Original comment by wasserman.louis
on 7 Dec 2011 at 2:37
Original comment by fry@google.com
on 10 Dec 2011 at 3:12
Original comment by wasserman.louis
on 31 Dec 2011 at 5:17
Didn't want to create a separate ticket for this question.
'put' has Key-Value association semantics by convention and 'add' has
Set-element-addition semantics, (both in Java at least).
I am wondering why the BloomFilter implementation uses 'put' instead of 'add'
even though the original API draft at the top of this page says 'add' a
possible 'addAll'.
Original comment by vaya...@gmail.com
on 7 Apr 2012 at 12:05
The proposed addAll(BloomFilter<T>) method (alternately called "union") is
essential for my needs, which involve populating several compatible filters on
separate computers and then combining the (deserialized) filters back into one.
At present, there's no way build up one filter from one or more other filters
(despite the copy() method being provided).
Original comment by sehar...@gmail.com
on 14 Jun 2012 at 2:37
It's not clear to me that such a method could be defined, depending on the
definition of "compatible" filters.
Original comment by wasserman.louis
on 14 Jun 2012 at 2:39
Comment #5 above touches on this challenge of deducing that two filters are
compatible. I suspect that you won't buy this stance, but I'd be happy enough
with trusting the caller. If the caller doesn't understand well enough the
invalid consequences of pouring one bag of bits into another, there's not much
we can do for him. I'd rather see it become /possible/ to perform this
operation than continue to ban it, just because it's vulnerable to misuse.
Original comment by sehar...@gmail.com
on 14 Jun 2012 at 4:53
Hmmm. How would you feel about throwing an exception if the target filter
wasn't initialized with the same parameters?
Original comment by wasserman.louis
on 14 Jun 2012 at 4:58
Throwing an exception (IllegalArgumentException) is a reasonable response to
trying to merge in a filter constructed with different parameters. The addAll
(or merge() or union()) method would then involve first verifying equivalent
configuration of the incoming filter and, upon confirming compatibility, ORin
in its underlying bit vector, right?
Original comment by sehar...@gmail.com
on 14 Jun 2012 at 5:44
Yep, that's what I'd expect.
Original comment by wasserman.louis
on 14 Jun 2012 at 5:46
This issue has been migrated to GitHub.
It can be found at https://github.com/google/guava/issues/<id>
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:16
Original comment by cgdecker@google.com
on 3 Nov 2014 at 9:10
Original issue reported on code.google.com by
kevin...@gmail.com
on 23 Oct 2007 at 4:02