twitter / algebird

Abstract Algebra for Scala
https://twitter.github.io/algebird
Apache License 2.0
2.29k stars 345 forks source link

FlowerFilter monoid #135

Open johnynek opened 11 years ago

johnynek commented 11 years ago

http://eng.42go.com/a-simple-time-decaying-approximate-membership-filter/

This stores each item T in each of it's hash bucket indices (so it is considerably larger than a normal bloom filter), but it considers something present if t: T exists in one of the hashed locations.

This is interesting because it can only remember recent items, which is often useful for some streaming join, or some decayed approximation algorithm.

krishnanraman commented 11 years ago
 So you need a family of hash functions to pull this off. If you have n hashfunctions, and an array of size s, the following block of code does the rest:

scala> def hashFunction(n:Int, s:Int) = {
 | val p = math.ceil( (1 + math.sqrt(1+4*n))/2.0).toInt
 | val a = (1 to (p-1))
 | val b = (1 to p)
 | (k:Int) => a.map( ia => b.map( ib => ((ia*k+ib)%p)%s)).flatten.slice(0,n)
 | }
 hashFunction: (n: Int, s: Int)Int => scala.collection.immutable.IndexedSeq[Int]

So if you need 10 hash functions over an array with capacity to store 100 elements:

 scala> hashFunction(10,100)
 res3: Int => scala.collection.immutable.IndexedSeq[Int] = <function1>

To actually use the 10 hash functions on a key, say key = 34 scala> res3(34) res4: scala.collection.immutable.IndexedSeq[Int] = Vector(3, 0, 1, 2, 1, 2, 3, 0, 3, 0)

The family of hashfunction algo : From pg 267, Corman & Leiserson ( Designing a universal class of hash functions )

krishnanraman commented 11 years ago

Edit: p has to be prime, btw :) So the algorithm above needs a little more work.

johnynek commented 11 years ago

We've already done this three times in the code: min hash, count min sketch, and bloom filter. Here's one more approach: http://stackoverflow.com/questions/9553989/what-hashing-techniques-to-use-when-building-a-bloom-filter-in-clojure

I don't think we need a new hash family. In fact, I'd argue we should only use one approach in this code.

avibryant commented 11 years ago

Agreed, we should have a HashFamily that gets used by all of these.

BTW this is a variation of what @jmhodges calls "The opposite of a bloom filter", right? http://www.somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/

krishnanraman commented 11 years ago

In that case, the 'val hashFunctions' implementation in min hash should suffice for my purposes.

I will suggest benchmarking... the Leiserson algo is very,very cheap - generate a prime number once and integral mod twice. Whereas the implementation you have uses a random number gen plus other heavy machinery - maybe more robust but also more expensive.

davidthomas5412 commented 11 years ago

Do you feel aFlowerFilter.contains(item) should return a Boolean or an ApproximateBoolean like BloomFilter?

Although it goes against the precedent, I believe it might be better to return a Boolean because:

1) goal is to be fast/streaming and calculating probability for this structure is really expensive (based on http://eng.42go.com/flower-filter-an-update/ plus extra multiplicative factor for possible collisions when converting item to Byte or Short) 2) the user sets their decay and accuracy thresholds when initializing 3) wouldn't be that useful for standard use cases

Feedback appreciated.

johnynek commented 11 years ago

David, if we could get a correct bound on the Boolean, then we should do so. So, when it says true, the probabilty it is correct is 1. But when we say absent, what is the probability that another item has evicted that item? Can we derive such a bound?

davidthomas5412 commented 11 years ago

Yes sir, a slight modification to http://eng.42go.com/flower-filter-an-update/ for fingerprinting collisions yields the true probability. Two other things that I missed:

1) We can find a simple log bound as opposed to exact computation 2) We can make the computation lazy so that the data structure is fast when the probabilities are not used.