mozilla / crlite

WebPKI-level Certificate Revocation via Multi-Level Bloom Filter Cascade
http://www.cs.umd.edu/~dml/papers/crlite_oakland17.pdf
Mozilla Public License 2.0
77 stars 6 forks source link

xor filters #44

Open est31 opened 4 years ago

est31 commented 4 years ago

(originally filed at https://github.com/mozilla/rust-cascade/issues/18 when this repo hasn't yet been opened)

It would be interesting to investigate usage of xor filters for crlite: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/

The authors claim they are both faster and create smaller filters than bloom filters. Probably this is true for crlite's use case as well?

BruceMaggs commented 4 years ago

It's possible that the size could be reduced, but there isn't a lot of room for improvement.   If the sizes of the bloom filters are set as described in the conference paper, the size of the filter cascade is within a factor of 1.44 of an easily proven lower bound.

Bruce

On 1/14/2020 3:32 PM, est31 wrote:

(originally filed at mozilla/rust-cascade#18 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mozilla_rust-2Dcascade_issues_18&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=8VoB0P1GMGmZhxiuJyTXvQ&m=_uXi26ElTIJUvH87VkxsJUkaSolkw-70YSZMlKnfKR4&s=tHKyCHYwfOWOo6GCDoJcmxy4HR9GYqFKLOY-XQ_zk8s&e= when this repo hasn't yet been opened)

It would be interesting to investigate usage of xor filters for crlite: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/ https://urldefense.proofpoint.com/v2/url?u=https-3A__lemire.me_blog_2019_12_19_xor-2Dfilters-2Dfaster-2Dand-2Dsmaller-2Dthan-2Dbloom-2Dfilters_&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=8VoB0P1GMGmZhxiuJyTXvQ&m=_uXi26ElTIJUvH87VkxsJUkaSolkw-70YSZMlKnfKR4&s=sX6bLOywJp_lfWuREeSGrx9t5TXYxS2kCtC40KhtSEk&e=

The authors claim they are both faster and create smaller filters than bloom filters. Probably this is true for crlite's use case as well?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mozilla_crlite_issues_44-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DACJHMFWYFQMSYRCHOBPAC4LQ5YOPRA5CNFSM4KGZEIUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IGFKATQ&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=8VoB0P1GMGmZhxiuJyTXvQ&m=_uXi26ElTIJUvH87VkxsJUkaSolkw-70YSZMlKnfKR4&s=0vBaX_m_R-WRZbzNKm3kWknS3WPw4F3vSPlfpdl0JeA&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACJHMFX5YUZRJYGVGNX5FJ3Q5YOPRANCNFSM4KGZEIUA&d=DwMCaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=8VoB0P1GMGmZhxiuJyTXvQ&m=_uXi26ElTIJUvH87VkxsJUkaSolkw-70YSZMlKnfKR4&s=NBSjzTs81xL29lHsCtULqohG2pocfUVPny-czDrliJM&e=.

lemire commented 3 years ago

Note that xor filters can also be faster than Bloom filter. A Bloom filter requires k has values and k memory accesses whereas the xor filter uses a flat 3-wise model.

bitwiseshiftleft commented 2 years ago

Another possibility, similar to xor filters and specifically tuned for CRLs:

https://github.com/bitwiseshiftleft/frayed_cascade

These improve on CRLite's Bloom filters in two ways: by a factor of 1.44 by using "frayed ribbon filters" instead of Bloom filters, and by another factor of ~1.15 by using only a 2-level cascade instead of (log n)-level, so a typical full database would use about 40% less space than a CRLite one. The 1.44 factor is constant, while the 1.15 one mostly increases as more certs get revoked, so it's less than 1.15 for compressing daily updates, and much more if there's another Heartbleed. This filter is always within 11% of the Shannon entropy of the "revoked vs not" distribution of the certs, so to the extent that revocation is iid, it's almost the best you can do.

In principle, frayed cascade has slower construction and faster queries than CRLite, but due to being written in C instead of Python it's faster for construction as well. However, the implementation also definitely needs testing and fuzzing -- it's currently research grade. But I'm happy to help out with this if you think the space and speed improvements are worthwhile.

Also, other researchers have found other good ways to compress dictionaries, and xor filters are not quite state of the art anymore. Ribbon filters and binary fuse filters are both faster to construct than frayed ribbon filters, and can also be used in a 2-level cascade structure, but they achieve 5-8% worse compression than frayed cascade.

thomasmueller commented 2 years ago

@bitwiseshiftleft I'm afraid I don't fully understand this topic yet (CRLite's, frayed_cascade). In frayed_cascade, you wrote "The other ~10.9% comes from the encoding that handles non-uniform values." Is this because you encode ranges (0...x) in the form of a number of bits? If yes, then I wonder if non-power-of-two functions would help. E.g. xor filters support 10.666 bits but really any range is possible (of course there is a performance penalty). The same can be applied to e.g. cuckoo filters, and probably (not sure) ribbon filters.

bitwiseshiftleft commented 2 years ago

Hi Thomas,

Something like that, yes. So for CRLite, you encode the CRL as a Bloom filter B1 of revoked certs. Bloom filters have false positives, but you know the universe of issued certs (issued by a certain collection of CAs as of a certain time) from CT logs, so you know all the false positives. So you also can create a Bloom filter B2 of all false positives in B1, but that also has false positives, so you need a B3, B4 etc. The filters get smaller by a certain factor every time, so after O(log n) levels there are no more false positives. To get the best compression, the false positive rates of the filters are carefully tuned.

For Frayed Cascade, you can use a static function as an approximate set (i.e. a Bloom filter replacement that's more efficient by a factor of log 2). So you can encode the CRL as a frayed ribbon filter F1 representing an approximate set of revoked certs, again with an appropriately tuned false positive rate. (If approximately 50% of certs are revoked, then the optimal false positive probability is 100% so the first layer is skipped.) The second level is then a function that maps { false positives in F1 => 0; true positives in F1 => 1 }, again encoded using frayed ribbons. So you don't need a third level. This technique is more efficient than the log-n-level filter cascade, but it doesn't reach the Shannon entropy. The worst case is about 10.9% overhead vs Shannon, which occurs when about 20% of certs are revoked.

This technique can't be done with Bloom filters, but it can be done with xor filters, ribbon filters, binary fuse filters etc because they work as static functions, not just approximate set membership. However, if I understand correctly, balanced ribbon filters (the variant with the least space overhead) supports only approximate set membership and not static functions, so they can't be used for the second level of the cascade. Instead standard ribbon filters (or xor+ filters, or binary fuse filters...) would need to be used for that stage, with an overhead more like 8%. Also those libraries would need to be reworked slightly to support an arbitrary integer number of bits instead of just {8, 16, 10.666 etc}, but that's straightforward.

The space overhead can be improved from 10.9% to something like 9% (I don't remember the exact value) by supporting non-power-of-two functions, but it still doesn't reach the Shannon limit. To get the the best compression you'd need to support mod-n filters for all n, and then compress the filter on the wire using arithmetic coding. However, since the worst cases are for a small number of bits, you might get most of the way there by implementing just mod-3 filters (1.58 bits) and maybe mod-5 filters (2.32 bits). This could be done with a fixed encoding such as 19 bits / 12 symbols or 65 bits / 41 symbols, but it won't align as cleanly as the 10.666-bit filters.

But for this case, the improvement in worst-case outcomes is only about 2% and the cost is having to implement matrix arithmetic and encoding over multiple rings, so I didn't do it.

For the more general case of multiple possible outputs according to an arbitrary distribution (e.g. if for some reason you wanted to encode revocation reasons), the generalization also uses at most 10.9% overhead, though it may take more than 2 levels. Of course for some cases arbitrary rings would be a big improvement: eg if you have 3 equally likely outputs then using a GF(3) map (1.58 bits / key-value pair) would of course be much better than building it out of GF(2) maps and an encoding (uses 5/3 bits/kvp ~ 5% worse). But again you'd need to support those extra rings.

thomasmueller commented 2 years ago

Thanks a lot! A few random notes:

3-way binary fuse filters have an overhead of 12.5%, 4-way 7.5%, and e.g. 8-way would have less overhead but I don't know how low we could go - maybe it makes sense to test this.

The definitions, and numbers, are not completely clear to me... In the original CRLite paper if I read correctly there are 30 million valid, unexpired certificates, but 12.7 are revoked. So the universe is 30 million? If yes, then with a static function over the valid, unexpired certificates, that should be around 30 million bits (plus overhead, e.g. 7.5% for 4-way binary fuse filter, but we can go within a few percent with 8-way or so), that would be 3.9 MB or 3.7 MiB. CRLite is 10 MB.

Is the "universe" (certificates we need to care about) the valid, unexpired certificates, or does it include all certificates (around 184 million when the CRLite paper was written)?

CRLite uses Bloom filters. I assume they are "optimal" k, so the overhead is 44% per Bloom filter. But if e.g. k=1 is used, then the Bloom filter could be compressed over the wire (e.g. arithmetic coding) to a lower overhead, I believe. So I wonder how important is compression over the wire, vs. compression on disk / in memory (runtime).

CRLite uses log(n) Bloom filters. That sounds like BBHash, a minimal perfect hash algorithm (which needs about 3 bits / key). I wonder if we indeed create a minimal perfect hash function here. It sounds like. There would be other minimal perfect hash functions that need much less space, e.g. RecSplit. The theoretical lower bound to create a static function like this would be much higher than with xor / binary fuse / ribbon, as we would need 1.44 bits / key plus the array of function results (which could be compressed with arithmetic coding).

Frayed Cascade uses one or two levels. To me it sounds a simpler design. But I don't currently understand part of the overhead you talk about... e.g. I don't understand what you mean with "mod-n filters": is that a static function that can return a value between 0 (including) and n (excluding)? Don't you need just 0 (valid) vs. 1 (invalid)? I see that it's likely not 50% are 0 and 50% are 1, but it's different from a static function that can return other values than 0 and 1.

bitwiseshiftleft commented 2 years ago

Huh, yeah the original CRLite paper has very different numbers. The more recent blog post has 750k revoked from a universe of 100 million. In the paper case it's best to use a static function directly, since the entropy is about 0.98 bits. And indeed that's what frayed_cascade does on toy data of this size: it skips the first level and uses a single static function, taking up 1.001 bits per key = 3.75 MB = Shannon + 1.8%. A binary fuse filter would be only slightly larger. CRLite performs worse the more certs are revoked, so that's why it's so much larger in this case.

The universe is only unexpired, valid-except-possibly-revoked certificates from participating CAs -- I understand this to be CAs who publish CRLs in a timely fashion and participate in cert transparency.

It might be worth using a different CRLite (or improved version) filter per CA. They will have different user bases, different revocation policies etc, and thus different revocation rates. Any information about the revocation probability of a cert will help compression. Eg if one CA attracts shady users that get revoked a lot, or if it revokes certs a few days after you request a new one as a matter of policy (I think maybe Apple'd developer CA does this??) then it will have lots of revocations, whereas maybe Let's Encrypt issues zillions of certs but doesn't revoke them often because they're short-lived (but they don't publish a CRL so I have no idea).

CRLite uses optimal uncompressed Bloom filters. Using compressed Bloom filters would also work, but also if you're going to reimplement CRLite there's not point in using Bloom filters anymore, since these other filters work better ... except maybe if there's some way to rescue CRLite's incremental update scheme. Their update scheme scheme can't use less bandwidth than sending down a daily filter of "what certs were revoked today" and simply querying all daily filters since either a full download or since the cert was issued; and it probably won't use less disk space either if it's using compressed Bloom filters; but maybe it would have faster queries.

I agree that minimal perfect hashing is not ideal for this use case.

Frayed cascade is optimized for {0,1} output but it does support wider ranges, since it's designed with CRLite in mind but not exclusively for that case. For the {0,1} case, if the revocation fraction is less than 1/3, it has two levels: first an approximate set membership filter using k bits per revocation achieving a 2^-k false positive rate; and then a static function which tells you whether a positive result is a false positive or a true positive using 1 bit per (true or false) positive. The approximate set membership filter is just a static function which returns k bits, but always returns (eg) 0 for revoked certs, and should return an arbitrary result for non-revoked ones. Here the false positive rate 2^-k must be tuned for optimal compression, as something like k = floor(log2(nonrevoked / revoked)). (With revoked > 1/3, this gives k=0, i.e. the first level is skipped.)

If the false positive rate could be tuned to something other than a power of 2, you'd get slightly better compression in some cases. For this, a mod-n filter could be used to get a non-power-of-2 false positive rate, e.g. a static function that returns {0,1,2,3,4} but always returns 0 for revoked certs. Likewise xor filters with 10.666 bits per key are really just doing all the math mod 1625 = 2^10.666..., and packing 3 such coefficients into each 32-bit word. But since the worst cases for frayed_cascade are for rather larger p (eg p = 1/5 or 1/9, not 1/1625) you'd want mod-3 or mod-5 filters and not mod-1625.

But anyway this only saves a couple percent in the worst case, and it's kind of a pain, so I didn't do it.

thomasmueller commented 2 years ago

@bitwiseshiftleft thanks a lot for your time! I think I now fully understand the problem. What is needed (on a high level) is a static function for 100 million entries that returns 1 in 0.7% of the cases and 0 in 99.3% of the cases. You do this with an approximate membership filter, plus a static function to correct the false positives. Yes this all makes sense.

I ask myself, could this be done with just one static function over the whole 101 million entries, and then compression (arithmetic coding / ANS). I guess you have already thought about that, right, and came to the conclusion that it won't work :-)

bitwiseshiftleft commented 2 years ago

Oh also: I have no idea if something like 8-way binary fuse filters would work. It's unpublished, but as I understand it it's still based on hypergraph peeling, which depends on some hash buckets being almost-empty. At least for xor filters, the global optimum is actually 3-way, so I wouldn't be surprised if 4-way is the global optimum for BFFs. Could be worth asking the authors of BFF for insight.

bitwiseshiftleft commented 2 years ago

I ask myself, could this be done with just one static function over the whole 101 million entries, and then compression (arithmetic coding / ANS). I guess you have already thought about that, right, and came to the conclusion that it won't work :-)

Well, not so much a conclusion, but I didn't figure out how to do it. But I'm actually mainly a cryptographer and not a data structures researcher so maybe there is a more efficient way that I just don't know / couldn't figure out.

lemire commented 2 years ago

@thomasmueller : I am game to investigate this issue further if you are (in 2022).

bitwiseshiftleft commented 2 years ago

Oh also: I have no idea if something like 8-way binary fuse filters would work. It's unpublished, but as I understand it it's still based on hypergraph peeling, which depends on some hash buckets being almost-empty. At least for xor filters, the global optimum is actually 3-way, so I wouldn't be surprised if 4-way is the global optimum for BFFs. Could be worth asking the authors of BFF for insight.

... wait a minute, you all are the authors of BFF. So you know better than I do on this. How do BFFs work anyway? Skimming the code it looks like if you choose the indices "just so" then the graph remains peelable even with a lower load?

thomasmueller commented 2 years ago

How do BFFs work anyway?

The theory behind this is found in https://arxiv.org/pdf/1907.04749.pdf "Dense Peelable Random Uniform Hypergraphs" - here you see that the overhead for 7-way would be about 3% over the lower bound.

BTW I tried "static function over the whole 101 million entries, and then compression"... with xor and fuse filters at least, compression didn't work: even if there is just one entry that has a different value than all the others, I couldn't compress the filter.

So what you do seems to be the best possible solutions, as far as I can tell. I could only imagine that a static function built from a perfect hash might work. (It wouldn't be a minimal perfect hash... it would be similar to a k-perfect hash function.) I don't think it would be any smaller than what you do however.

bitwiseshiftleft commented 2 years ago

That's super cool. It's simpler (to generate anyway) and performs better than frayed ribbons, so it's definitely a reasonable tradeoff to take a 3-7% space hit for many applications.

I wonder if it would significantly help with frayed ribbons to make them not wrap around, like fuse filters? It might improve performance but I guess the generation code wouldn't be any simpler.