I'm not sure what the origin of the calc_size() formulation used in this implementation was, but it's a bit sub-optimal. Here are some plots showing the size of the existing "Moz" bloom filters versus the 1.44nlog_2(1/p) formulation from the CRLite paper:
All three plots show how filters using the two size formulas vary as the number of elements (n) and false positive rate (FPR, p) vary. The top graph is total filter sizes in bits, the middle graph is difference in size in bits, and the bottom graph is percentage increase in size.
Depending on how things work out with math.ceil(), the existing formulation produces filters that are 0-5% oversized.
I also added some additional, optional parameters to the filter cascade constructor, so downstream users can parameterize the free space overhead and minimum filter sizes.
I'm not sure what the origin of the calc_size() formulation used in this implementation was, but it's a bit sub-optimal. Here are some plots showing the size of the existing "Moz" bloom filters versus the 1.44nlog_2(1/p) formulation from the CRLite paper:
All three plots show how filters using the two size formulas vary as the number of elements (n) and false positive rate (FPR, p) vary. The top graph is total filter sizes in bits, the middle graph is difference in size in bits, and the bottom graph is percentage increase in size.
Depending on how things work out with math.ceil(), the existing formulation produces filters that are 0-5% oversized.
I also added some additional, optional parameters to the filter cascade constructor, so downstream users can parameterize the free space overhead and minimum filter sizes.