joseph-fox / python-bloomfilter

Scalable Bloom Filter implemented in Python
MIT License
164 stars 25 forks source link
bigdata bloomfilter-python membership-query

Build Status

Python Bloom Filter

This Bloom Filter has its tightening ratio updated to 0.9, and this ration is consistently used throughout the pybloom module. Choosing r around 0.8 - 0.9 will result in better average space usage for wide range of growth, therefore the default value of model is set to LARGE_SET_GROWTH. This is a module that includes a Bloom Filter data structure along with an implementation of Scalable Bloom Filters as discussed in:

P. Almeida, C.Baquero, N. Preguiça, D. Hutchison, Scalable Bloom Filters, (GLOBECOM 2007), IEEE, 2007.

Bloom filters are great if you understand what amount of bits you need to set aside early to store your entire set. Scalable Bloom Filters allow your bloom filter bits to grow as a function of false positive probability and size.

A filter is "full" when at capacity: M * ((ln 2 ^ 2) / abs(ln p)), where M is the number of bits and p is the false positive probability. When capacity is reached a new filter is then created exponentially larger than the last with a tighter probability of false positives and a larger number of hash functions.

    >>> import pybloom_live
    >>> f = pybloom_live.BloomFilter(capacity=1000, error_rate=0.001)
    >>> [f.add(x) for x in range(10)]
    [False, False, False, False, False, False, False, False, False, False]
    >>> all([(x in f) for x in range(10)])
    True
    >>> 10 in f
    False
    >>> 5 in f
    True
    >>> f = pybloom_live.BloomFilter(capacity=1000, error_rate=0.001)
    >>> for i in xrange(0, f.capacity):
    ...     _ = f.add(i)
    >>> (1.0 - (len(f) / float(f.capacity))) <= f.error_rate + 2e-18
    True

    >>> sbf = pybloom_live.ScalableBloomFilter(mode=pybloom_live.ScalableBloomFilter.SMALL_SET_GROWTH)
    >>> count = 10000
    >>> for i in range(0, count):
            _ = sbf.add(i)

    >>> (1.0 - (len(sbf) / float(count))) <= sbf.error_rate + 2e-18
    True
    # len(sbf) may not equal the entire input length. 0.01% error is well
    # below the default 0.1% error threshold. As the capacity goes up, the
    # error will approach 0.1%.

Development

We follow this git branching model, please have a look at it.

Installation instructions

If you are installing from an internet-connected computer (or virtual install), you can use the pip python package manager to download and install this package. Simply type pip install pybloom-live from a DOS command prompt (cmd.exe) or a linux shell (e.g. bash or dash on MacOS X as well as linux OSes including debian, slackware, redhat, enoch and arch).

If using Windows and you are installing onto an air-gapped computer or want the most up-to-date version from this repository, you can do the following:

  1. Download the zip file by clicking on the green "Clone or Download" link followed by "Download Zip."

  2. Extract all the contents of the the zip folder.

  3. Open command prompt (cmd.exe) to the extracted folder. a. Find the extracted folder in Windows Explorer. b. From the parent folder level Shift+RightClick on the folder. c. Select "Open command window here".

  4. Type pip install ..

Similar steps are possible under linux and MacOS X.

Breaking changes with 4.x

Support for non-cryptographic hashes has been added in 4.0.0. For 128 bit hashes, md5 has been replaced with xxh3_128, one of the fastest non-cryptographic hash functions. Details of the benchmark runs can be found here. Files generated with earlier versions of the module will not work with this version. Consider re-generating them using the latest version optimized for speed.

Installation verification

Type pip show pybloom-live from a command prompt. Version should be 2.2.0 as of 2016-12-11.