mozilla / filter-cascade

A python filter cascade implementation
Mozilla Public License 2.0
6 stars 5 forks source link

support bloom filters that have more `include` items than `exclude` #8

Closed eviljeff closed 4 years ago

eviljeff commented 4 years ago

If len(include) > len(exclude) then then an error is thrown:

/deps/lib/python3.6/site-packages/filtercascade/__init__.py:268: in cascade_with_characteristics
    [Bloomer.filter_with_characteristics(capacity, error_rates[0])],
/deps/lib/python3.6/site-packages/filtercascade/__init__.py:75: in filter_with_characteristics
    return Bloomer(size=size, nHashFuncs=nHashFuncs, level=level)
/deps/lib/python3.6/site-packages/filtercascade/__init__.py:30: in __init__
    self.bitarray = bitarray.bitarray(self.size, endian='little')
E   ValueError: cannot create bitarray with negative length

For the addon blocklist it's possible that we will have at some point more blocked addons (include) than not blocked.

jcjones commented 4 years ago

This actually isn't a problem in filtercascade but rather in the calling code, presumably patterned off of certs_to_crlite.py. If you use the defaults for capacity and error rates, then it won't end up in this situation.

That said, it does end up being size-inefficient, so I am adding inversion-mechanics to the tooling. But basically, the problem is the math in getFPRs, and my patch won't fix that part of your calling code, @eviljeff, just heads-up.

eviljeff commented 4 years ago

If we can just use defaults that's better for us - I'd much prefer it if the filter could be initialized with just the data and the appropriate sizes/rates calculated.

Unfortunately there aren't any defaults for the cascade_with_characteristics function though: https://github.com/mozilla/filter-cascade/blob/v0.2.2/filtercascade/__init__.py#L284

This is what we have (patterned off certs_to_crlite as presumed)

    fprs = [len(blocked) / (math.sqrt(2) * len(not_blocked)), 0.5]
    cascade = FilterCascade.cascade_with_characteristics(
        int(len(blocked) * 1.1), fprs)
    cascade.version = 1
    cascade.initialize(include=blocked, exclude=not_blocked)
jcjones commented 4 years ago

__init__ has defaults.

I'll work to improve the documentation. Sorry, I didn't write this code in the first place, I honestly didn't even include the cascade_with_characteristics class constructor in the first round of tests.