google / guava

Google core libraries for Java
Apache License 2.0
50.24k stars 10.91k forks source link

ProbabilisticSet or ProbabilisticFilter interface, implemented by BloomFilter #2226

Open bdupras opened 9 years ago

bdupras commented 9 years ago

tl;dr: A Cuckoo Filter provides similar functionality to a Bloom Filter with lower hashing overhead and deletion support.

Add: public interface ProbablisticFilter Add: public class CuckooFilter implements ProbablisticFilter Modify: public class BloomFilter implements ProbablisticFilter

The Cuckoo Filter is described in this paper: Cuckoo Filter: Practically Better Than Bloom. Quoting from the abstract:

We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.

Possbile Interface:

public interface ProbablisticFilter<T> {
    boolean mightContain(T object);
    boolean put(T object);
    long size(); // cardinality
    long capacity(); // or expectedInsertions()
    double expectedFpp();
}

We can contribute an implementation that follows the norms of the BloomFilter design if there's interest.

@bdupras, @beala

/cc @binfan999 @apc999 @dave-andersen @mkaminsky & Michael Mitzenmacher (couldn't find his GitHub username)

ben-manes commented 9 years ago

A casual read, this looks similar to TinySet and TinyTable.

apc999 commented 9 years ago

@bdupras thinks for ping me on this issue. I will be able to help if there is anything needed from my side. Question here, I guess for CuckooFilter, it will implement delete (or similar) as that is the differentiator of CuckooFilter from BloomFilter, right?

apc999 commented 9 years ago

@ben-manes I think cuckoo filter shares the similar idea of TinySet in the way they both store fingerprints in a hash table for approximate set-membership tests (like BloomFilter). One key difference is TinySet is more like using traditional open-addressing hashtable, while cuckoo filter is built based on partial-key cuckoo hashing (see more in our paper Cuckoo Filter: Practically Better Than Bloom). My very biased view is, in the context of achieving really high-space efficiency and also good performance, partial-key cuckoo hashing is very easy to understand and implement correctly. :)

bdupras commented 9 years ago

... CuckooFilter, it will implement delete (or similar)

Yes, absolutely. Perhaps as a separate interface, e.g. ...

public interface DeletableFilter<T> {
    boolean delete(T object);
}
bdupras commented 9 years ago

@apc999 I'll see if I can get our first attempt at an implementation out to a place where you can have a look. Nicely written paper, by the way - very easy to understand.

A few things of note about our implementation...

We chose a different strategy for calculating the alternate index. The paper's method using xor has a side-effect of requiring the number of buckets (B) to be a power of 2. Our implementation calculates an offset from the current index that is always odd, and then applies that offset positive or negative depending on whether the current index is even or odd, wrapping around the ends of the bucket array as necessary. This requires only that the number of buckets be even, giving us more flexibility in the memory footprint than power of 2.

We also implemented rollback on insert such that if we exhaust the max_kicks, we rollback and leave the filter in a consistent state before throwing an insertion error.

@beala - did I forget anything?

beala commented 9 years ago

Hi @apc999! Thanks for offering your help and thanks for your paper! It's been a big win for us.

Two other additions:

  1. Our fingerprint is the first non-zero string of bits in the hash of the key. If the hash is 0, the fingerprint is 1. Without this, it's ambiguous if a cell containing 0 is empty, or contains a fingerprint of 0.
  2. We created a "growable cuckoo filter" which contains a list of cuckoo filters. If a cuckoo filter is full, we allocate another filter and insert into that. The rollback on insert makes this possible (otherwise an insert failure drops data).
kevinb9n commented 9 years ago

We are sympathetic to the need for an interface, but we think new types of probabilistic sets can be developed and published separately from Guava.

bdupras commented 9 years ago

Thanks @kevinb9n. We'd certainly welcome an interface to implement against. Would you be interested in us providing a pull request to do that?

@kluever, we noticed in issue #1518 that you requested deletion support from the Bloom Filter. Are you still interested having a deletable filter implementation?

kluever commented 9 years ago

@bdupras I personally don't have a need for that feature (I filed the issue on behalf of a user). https://groups.google.com/forum/#!topic/guava-discuss/bwimfMksFWg