Python bindings for C implementation of Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters and of Binary Fuse Filters: Fast and Smaller Than Xor Filters.
pip install pyxorfilter
git clone --recurse-submodules https://github.com/glitzflitz/pyxorfilter
cd pyxorfilter
python setup.py build_ext
python setup.py install
>>> from pyxorfilter import Xor8, Xor16, Fuse8, Fuse16
>>> filter = Xor8(5) #or Xor16(size)
>>> #Supports unicode strings and heterogeneous types
>>> test_str = ["あ","अ", 51, 0.0, 12.3]
>>> filter.populate(test_str)
True
>>> filter.contains("अ")
True
>>> filter[51] #You can use __getitem__ instead of contains
True
>>> filter["か"]
False
>>> filter.contains(150)
False
>>> filter.size_in_bytes()
60
You can serialize a filter with the serialize()
method which returns a buffer, and you can recover the filter with the deserialize(buffer)
method, which returns a filter:
> f = open('/tmp/output', 'wb')
> f.write(filter.serialize())
> f.close()
> recoverfilter = Xor8.deserialize(open('/tmp/output', 'rb').read())
For more accuracy(less false positives) use larger but more accurate Xor16 for Fuse16.
For large sets (contain millions of keys), Fuse8/Fuse16 filters are faster and smaller than Xor8/Xor16.
>>> filter = Xor8(1000000)
>>> filter.size_in_bytes()
1230054
>>> filter = Fuse8(1000000)
>>> filter.size_in_bytes()
1130536