igrigorik / bloomfilter-rb

BloomFilter(s) in Ruby: Native counting filter + Redis counting/non-counting filters
http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
472 stars 60 forks source link

Why is 'size' used with Redis backed bloomfilters? #31

Closed railsmechanic closed 10 years ago

railsmechanic commented 10 years ago

If I remember rightly, the size setting is required for bloomfilters based on fixed size data structures, like an array. (Rule of thumb: If you need a bloomfilter with n elements, just set size to a multiple of n.)

But I wonder why Redis backed bloomfilters like CountingRedis need this setting? As far as I can see, the size setting is just used to calculate the modulo of the crc32 hash.

(Zlib.crc32("#{key}:#{i+@opts[:seed]}") % @opts[:size]).to_s

Is this modulo instruction just used to shorten the hash? (For a large size, the benefit might be negligible.) Should this statement not be removed, in order to prevent persisted bloomfilters from being cluttered, when using a different setting for size?

igrigorik commented 10 years ago

Your "size" setting should be a function of acceptable error rate: http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/

In effect, you're trading off amount memory vs error rate, and that applies regardless of the backend you're using (redis, native, etc).