tc39 / proposal-built-in-modules

BSD 2-Clause "Simplified" License
891 stars 25 forks source link

FAQ: Why not use SRI-based caching instead? #18

Open littledan opened 5 years ago

littledan commented 5 years ago

This idea is so frequently cited that it might be worth including in an FAQ. The idea is to use some version of signature-based SRI-based caching which crosses origins instead (with fixes for versioning and rollbacks)? This could bring the performance benefits of built-in modules, while not biasing us towards the browser-provided modules and instead deferring to the ecosystem.

There are two issues I know of to this approach:

If anyone has good references on these topics, it'd be good to include them.

cc @lukewagner

ljharb commented 5 years ago

What are “origins” in a browser-agnostic context? or would the approach only be addressing browsers?

Mouvedia commented 5 years ago

By SRI you mean what? SubResource Integrity?

littledan commented 5 years ago

See past discussion of this idea at https://github.com/w3c/webappsec-subresource-integrity/issues/22

@Mouvedia Yes.

@ljharb This is an idea for browsers; for Node or other embedders, the hope might be to enable bytecode caching at all, since it doesn't have to worry about origin separation. @joyeecheung is working on this (starting with Node Core itself).

bmeck commented 5 years ago

This seems a generic issue for all modules. What makes builtin modules special here?

littledan commented 5 years ago

@bmeck What makes this related to built-in modules is, part of the motivation for built-in modules is to reduce the overhead from downloading them over the network. Cross-origin SRI-based caching is a potential mitigation to that same download size issue, which unfortunately doesn't seem to be feasible.

bmeck commented 5 years ago

@littledan I'm still unsure I understand on why this differs from other modules, I guess I'll wait on it. The bandwidth/time savings being similar to ecosystem is fine, but I remain unclear why SRI is problematic for builtin modules.

littledan commented 5 years ago

@bmeck (Sorry, I misunderstood your question.) Yes, you're right, cross-origin SRI-based caching faces this barrier whether or not it's caching something that's part of a built-in module polyfill.

tabatkins commented 5 years ago

So, the problems I've come to understand that people have with SRI-based cross-origin caching:

  1. Timing attacks. The full set of libraries that a given site uses tends to be fairly unique; while lots of sites might load jQuery (ignoring all the different versions for a moment...), the full set of extensions they additionally load tends to form a pretty unique fingerprint. As such, a hostile page that loads up a whole bunch of libraries and times them to figure out which came from cache and which hit the network would function as a pretty effective determiner of what sites have been recently visited by the user. (This is effectively a single-use attack; once one page does this, it poisons the cache for any other page trying to do it. But it's still considered dangerous.)

  2. A library that is likely to be cached is more attractive to use than one which probably needs to be fetched from the network; this encourages a minor "the rich get richer" effect where popular libraries remain popular because they're popular, and newer better libraries have trouble gaining traction.

  3. Cache-poisoning attacks. If you can engineer a hostile file that has the same SRI signature as a popular library, you can feed it to users and then have it unexpectedly loaded on other sites, getting a persistent XSS on them without them doing anything wrong. While the hashes SRI uses aren't expected to be attackable in this way in the reasonable future, things sometimes change!


I've given a lot of thought to these, and I think 1 and 2 can be reasonably mitigated by imposing a degree of randomness on the caching behavior. Basically:

  1. Randomly expire libraries from the cache regularly, increasing false-negative errors. Having almost all of the libraries on your page pre-cached automatically is still very worthwhile
  2. Randomly pre-load libraries into the cache based on usage data, increasing false-positive errors. Prefer libraries with low usage among sites, but used on sites with high usage. (This requires a degree of use-tracking, which would fall under the existing anonymous stat collection browsers already do.)

However, I don't see how to mitigate 3 without double-keying, which defeats the entire point. That said, we can counter-intuitively recover most of the cross-origin benefits in a double-keyed world if we just continue to use CDNs to load libraries; if everyone gets the library from the same 3rd-party origin, then keying to that origin doesn't defeat caching.

littledan commented 5 years ago

Hmm, I don't know what kind of math to use to understand the relationship between the "amount" of privacy preserved vs the performance degradation due to those two techniques... This reminds me a bit of checking 1/100 of the people in the airport and letting through 99% of the risk.

bkardell commented 5 years ago

@tabatkins last I checked though people don't get from the same origin and misses are pretty high, right? Wouldn't some extent of this require a kind of 'official' URL in order for that to actually work out?

tabatkins commented 5 years ago

An "official" url would help, sure. But centralizing effects would occur regardless, due to the value of using the same CDN origin as others. Right now there's not any particular reason to centralize.

jikkujose commented 5 years ago

@tabatkins wow I was thinking exactly the same thing. I believe the biggest problem in this whole idea is privacy. Depending on the downloads an observer system can get fair understanding of a user's browsing history.

Cache poisoning: Is this a serious issue in browser contexts? Manually replacing already downloaded files is close to impossible. And a hash collision for something like this can be easily mitigated by using a better hashing algorithm?