Closed joelweinberger closed 4 years ago
I hope this is a reasonable place to comment. (If not please tell me where to go.)
I've been working on content addressing systems for several years. I understand that content addresses, which are "locationless," are inherently in conflict with the same-origin policy, which is location-based.
An additional/alternate solution is for a list of acceptable hashes to be published by the server at a well-known location.
For example, the user agent could request https://example.com/.well-known/sri-list
, which would return a plain text file with a list of acceptable hashes, one per line. Hashes on this list would be treated as if they were hosted by the server itself, and thus could be fetched from a shared cache while being treated for all intents and purposes like they were fetched from the server in question.
This does add some complexity both for user agents and for site admins. On the other hand, the security implications are well understood, and wouldn't require new permission logic.
Thanks for your work on SRI.
An interesting idea (although I know many folks who are vehemently against well-known location solutions, but I won't pretend to fully grasp why). If implemented, though, it would still require a round trip to get .well-known/sri-list, right? Which seems to lose a lot of the benefit of these acting as libraries.
Another suggestion, that I think I heard somewhere, is, if the page includes a CSP, only use an x-origin cache for an integrity attribute resource if the CSP includes the integrity value in the script-hash whitelist. I think this would address @mozfreddyb's concerns listed in Synzvato/decentraleyes#26, but I haven't thought too hard about it. On the other hand, it also starts to look really weird and complicated :-/
Also, these solutions don't address timing attacks with x-origin caches. Although, as a side not, someone recently pointed out to me that history timing attacks in this case are probably not too concerning from a security perspective since it's a "one-shot" timing attack. That is, the resource is definitively loaded after the attack happens, so you can't attempt the timing again, and that makes the timing attack much more difficult to pull off, since timing attacks usually rely on repeated measurement.
Using a script-hash whitelist in the HTTP headers (as part of CSP or separately) is better for a small number of hashes, since it doesn't require an extra round trip. Using a well-known list is better for a large number of hashes, since it can be cached for a long time.
I agree that well-known locations are ugly. Although it works for /robots.txt and /favicon.ico, there is a high cost for introducing new ones.
The privacy problem is worse than timing attacks: if you control the server, you can tell that no request is ever made. This seems insurmountable for cross-origin caching.
Perhaps the gulf between hashes and locations is too large to span. For true content-addressing systems (like what I'm working on), my preference is to treat all hashes as a single origin (so they can't reference or be referenced by location-based resources).
Thanks for your quick reply!
I'd be slightly more interested in blessing the hashes for cross-origin caches by mentioning in the CSP. .well-known
would add another roundtrip. I'm not sure if that's going to impact hamper the performance benefit that we wanted in the first place.
The idea to separate hashed resources into their own origin is interesting, but I don't feel comfortable drilling holes that deep into the existing weirdness of origins.
To be clear, giving hashes their own origin only makes sense if you are loading top-level resources by hash. In that case, you can give access to all other hashes, but prohibit access to ordinary URLs. But that is a long way off for any web browsers and far from the scope of SRI.
For the record, @hillbrad wrote a great document outlining the privacy and security risks of shared caching: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html
That document doesn't appear to consider an opt-in approach. While this would reduce the number of people who do it it could be quite useful.
<script src=jquery.js integrity="..." public/>
This tag should only be put on scripts for which timing is not an issue. Of course deciding what is pubic
is now the responsibility of the website. However since the benefit would be negligible for anything that is website specific this might be pretty clear. For example loading a script specific to my site has a single URL anyways, so I may as well not put public
otherwise malicious sites can figure out who has been to my site recently even though I don't get any benefit from the content-addressed cache. However if I am including jQuery there will be a benefit because there are many different copies on the internet and at the same time it means that knowing whether a user has jQuery in their cache is much less identifying.
That being said if FF had a way to turn this on now I would enable it, I don't see the privacy hit to be large and the performance would be nice to have.
If I want to use the presence of my script in a shared cache to track you illicitly, I will deliberately set the public flag, even if the content isn't actually public.
On Mon, Oct 31, 2016 at 3:06 PM Kevin Cox notifications@github.com wrote:
That document doesn't appear to consider an opt-in approach. While this would reduce the number of people who do it it could be quite useful.
This tag should only be put on scripts for which timing is not an issue. Of course deciding what is pubic is now the responsibility of the website. However since the benefit would be negligible for anything that is website specific this might be pretty clear. For example loading a script specific to my site has a single URL anyways, so I may as well not put public otherwise malicious sites can figure out who has been to my site recently even though I don't get any benefit from the content-addressed cache. However if I am including jQuery there will be a benefit because there are many different copies on the internet and at the same time it means that knowing whether a user has jQuery in their cache is much less identifying.
That being said if FF had a way to turn this on now I would enable it, I don't see the privacy hit to be large and the performance would be nice to have.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/webappsec-subresource-integrity/issues/22#issuecomment-257434967, or mute the thread https://github.com/notifications/unsubscribe-auth/ACFbcMQKkDaic1pBKylEUbpeHMoE2GLOks5q5mZbgaJpZM4G5Tap .
On 21/12/16 01:07, Brad Hill wrote:
If I want to use the presence of my script in a shared cache to track you illicitly, I will deliberately set the public flag, even if the content isn't actually public.
If you want to track me and you control both origins you want to track me from you can just use the same URL and you get a cookie which is better tracking and works today.
This is about preventing a third-party site having a script with the same hash as for example a script on Facebook, then they can tell if you have been to facebook "recently". However since fb hosts the script they won't set it as "public" and so it won't be a problem.
I don't understand what threat you are trying to protect against.
A "public" flag seems like a good solution to me. It seems to encapsulate both the benefits and the drawbacks of shared caching. It says, "yes, you can share files publicly, but that means anyone can see them."
That said, if it's opt-in, there's the question of how many sites would actually use it, and whether it's worth the trouble. Especially if it has to be set in HTML, rather than say by CDNs automatically. Maybe it would work better as an HTTP header?
Setting in the HTML doesn't seem to be a big problem. If large CDN providers include this in their example script/style tags then sites will copy and paste support for this. A similar approach is currently being used for SRI and although it's not as fast as I'd like, usage will slowly grow. Sites that are also looking for those extra performance boosts would be keen to implement it.
The idea of a public header (or even another key in Cache-Control
) sounds quite interesting and elegant, however I think it would make it more difficult to use as one significant use case of this is to let each site to point to their own copy of a script, rather then a centrally hosted one. This means that each site would have to add headers to some of their scripts rather then just a modification in HTML. Not that either is a huge barrier but often static site hosting makes it difficult to set headers especially for a subset of paths.
At the end of the day I have not major objections to either option though.
@kevincox Yes, I was suspecting that Cache-Control: public
might be appropriate. It seems like the HTTP concept of a "shared cache" is fundamentally equivalent to SRI shared caching. See here for definitions of public
and private
: https://tools.ietf.org/html/rfc7234#section-5.2.2.5
The Cache-Control security concerns (cache poisoning, accidentally caching sensitive information) are prevented by hashing. The only remaining security consideration is information leaks, which Cache-Control: public
seems to address.
I'm not opposed to using an HTML attribute instead, but I think it's good to reuse existing mechanisms when they fit. Caching has traditionally been controlled via HTTP, not HTML.
There are a few other ways to break this down:
file:
, data:
, ftp:
, etc.) resources? (There's an argument for shared caching across protocols, which a HTTP header wouldn't really help with; on the other hand, caching doesn't make much sense for some protocols)I think that thinking about it in terms of "which method is easier for non-expert webmasters to deploy?" is likely to lead to a suboptimal solution. Yes some people don't know how to set HTTP headers, and some hosts don't let users set them, but in that case they are already stuck with limited caching options. Unless we're going to expose all of Cache-Control
via HTML.
@btrask A website highly concerned about privacy and loading <script src='/uncommon-datepicker.jquery.js' integrity="sha....." />
will want to make sure that uncommon-datepicker.jquery.js
is never loaded from the shared cache. Whether the shared cache should be used or not is to be controlled by the website using the resource and not by the server who first delivered the resource.
@brillout: Yes, good point. Using a mechanism not in the page source defeats the purpose, when the page source is the only trusted information. Thanks for the tip!
@metromoxie @mozfreddyb @kevincox @ScottHelme
Are we missing any pieces?
The two concerns are;
Solution to privacy: We can make the shared cache an opt-in option via an HTML attribute. I'd say it to be enough. (But if we want more protection then browsers could add a resource to the shared cache only when many domains use that resource. As described in https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html#solution and https://github.com/w3c/webappsec/issues/504#issuecomment-261755369).
Solution to CSP: UA should treat scripts with enabled shared cache as inline scripts. (As described here https://github.com/w3c/webappsec/issues/504#issuecomment-166458562.)
It would be super exciting to be able to use bunch of web components using different frontend frameworks behind the web component curtain. A date picker using Angular, an infinite scroll using React and a video player using Vue. This is currently prohibitive KB-wise but a shared cache would allow it.
And with WebAssembly the sizes of libraries will get bigger increasing the need of such shared cache.
@nomeata Funny to see you on this thread, the world is small
An opt-in privacy leak isn't a great feature to have.
An opt-in privacy leak isn't a great feature to have.
How about opt-in + a resource is added to the shared cache only after the resource has been loaded by several domains?
I don't think that really helps as the attacker can purchase two domains quite easily.
I don't think that really helps as the attacker can purchase two domains quite easily.
Yes it can't be n
domains where n
is predefined. But making n
probabilistic makes it considerably more difficult for an attack to be successful. (E.g. last comment at https://github.com/w3c/webappsec/issues/504#issuecomment-166458562.)
CSP has (is getting?) a nonce-based approach. IIUC the concern with CSP is that an attacker would be able to inject a script that loaded an outdated/insecure library through the cache, thus bypassing controls based on origin. However requiring nonces for SRI-based caching seems to solve this issue as the attacker wouldn't know the nonce; it also creates a performance incentive for websites to move to nonces, which are more secure than domain whitelists for the same reason[1].
I think it's possible that we could solve the privacy problem by requiring a certain number of domains to reference the script... it'd be really useful to have some metrics from browser telemetry here. For example if we determined that enough users encountered e.g. a reference to jQuery in >100 domains for that to be the minimum, it might be that we could load things from an SRI cache if they had been encountered in 100+ distinct top-level document domains (i.e. domains the user explicitly browsed to, not that were loaded in a frame or something). The idea being that because of the top-level document requirement, the attacker would have to socially engineer the user into visiting 100 domains, which would be very, very difficult. However if telemetry told us that 100 is too high a number and it's actually more like 20 for a particular jQuery version, that'd be a different story.
[1]: consider e.g. being able to load an insecure Angular version from the Google CDN because the site loaded jQuery from the Google CDN
For example, the user agent could request https://example.com/.well-known/sri-list, which would return a plain text file with a list of acceptable hashes, one per line.
For some domains that file could be too large and change too often. Consider Tumblr's image hosting (##.media.tumblr.com) where each of the domain names host billions of files and the list changes every second.
How about something similar to HTTP ETag but with a client-specified hash algorithm. If the hash is correct you only get a response affirming as much instead of the entire file, which the browser can cache. It doesn't save you the round trip but it saves you the data.
How about something similar to HTTP ETag but with a client-specified hash algorithm. If the hash is correct you only get a response affirming as much instead of the entire file, which the browser can cache. It doesn't save you the round trip but it saves you the data.
RFC 3230: Instance Digests in HTTP defines a Digest
header and a Want-Digest
header that work exactly this way...or was meant to.
This would get the 304 Not Modified
style responses, but it's still limited to a single URL check.
Maybe it (or something like it) coupled with the Immutable
header could be used to populate some amount of caching or "permanence," but the model is still about the "given name" of the object (it's URL) and not about its intrinsic identification (it's content hash).
Caching's one use case for these things, but the Web could also benefit from some "object permanence" where possible and appropriate.
I don't see the benefit from Want-Digest. If the client has a whitelisted digest and the content backing it why bother the server? There are three possible responses.
This would wait around for a response that can only make the situation worse.
However if telemetry told us that 100 is too high a number and it's actually more like 20 for a particular jQuery version, that'd be a different story.
Even if 100 is too high a number today, the load time advantages of using a popular version of the library could quickly push the usage of a specific version over that limit. Browser telemetry today might not be representative of the situation after shared caching has been rolled out.
The discussion so far seems to assume JS libraries e.g. jQuery as the canonical use case.
I'd like to add web fonts as another use case of widely shared large subresources which could benefit from cross-domain cache.
I'd think the security risks for fonts are milder, though privacy implications might be similar. I'm talking about font files themselves, not CSS — fonts CSS is small, and malicious CSS is dangerous.
Note that CSS does not yet support SRI at all on font file urls: https://github.com/w3c/webappsec-subresource-integrity/issues/40, https://github.com/w3c/webappsec/issues/306 Note also that in practice optimized font delivery varies by browser, for example Google Fonts doesn't want to support SRI: https://github.com/google/fonts/issues/473. (This is not a blocker for hashing & sharing, just a tradeoff...)
Couldn’t you just embed the font as data-uri in the CSS? With shared caching that would be efficient.
On 03/07/2019 04:23 PM, Arne Babenhauserheide wrote:
Couldn’t you just embed the font as data-uri in the CSS? With shared caching that would be efficient.
This doesn't help at all when the user first downloads your CSS as they get the whole font. If you had shared caching they can avoid redownloading the font if they either have it. For example if that font was used on another site or they visited your side with a previous version of the CSS.
It helps on the second access. By externalizing the font loading to a self-contained CSS file with sri-secured shared caching, the download could then be cached over multiple sites.
Yes, it would not be as good as specifying the integrity tag directly on the font, but the same is true for images and other resources, so I don’t see this as a blocker.
I was thinking about shared caches for several months, googling proposals and not finding anything before this. I am extremely excited about shared caches and opportunities that it enables, e.g. would help TC39 with identifying more in-demand libraries to include to standard library.
I have an idea about whitelisting hashes: there's https://en.wikipedia.org/wiki/Accumulator_(cryptography) this thing. So you can pass in CSP header one Accumulator of all integrity hashes saving bandwidth. I basically found it just because I had an instinct that it should be possible so got this answer https://crypto.stackexchange.com/questions/22410/hash-which-can-be-used-to-verify-one-of-multiple-inputs but I'm not sure it's applicable 100%.
If you want to track me and you control both origins you want to track me from you can just use the same URL and you get a cookie which is better tracking and works today.
This was reasonable in 2016, but it's different now: browsers are partitioning caches (Safari launched; Chrome and Firefox in progress), browsers are reconsidering third-party cookies, and what were talking about here could allow new ways of cross-site tracking.
I'm afraid Jeff is right. He even wrote a good summarizing blog post about the potential deprecation of shared caching.
While I don't know the timeline for this change, it seems rather unlikely that we can ever consider a shared cache :confused:
But what about cache of artifacts which are explicitly stated as "public artifacts" by website owner?
Website owners do not get to decide over end user privacy.
In my opinion, they already do. All the Googles and FACEBOOKs already have all the userdata and sell that to 3rd parties. They also have all the engineering power and monopoly on user attention so shared cache wouldn't really benefit them. But shared cache would enable small websites/indie-gamedevs and others to leverage cache and build more ambitious websites which should benefit end user. But to be fair, I have only a little idea what cache-hit rate would be.
Would this also apply with a standard source of allowed artifacts shipped with the browser? The release cadence of Firefox is 4 weeks now, so the delay until a resource is in the allowed list would be short.
An obvious first candidate would be jquery. Others with larger gains would be frameworks like vue.js
Developers who use npm/babel already specify dependencies by remote repositories. By vetting these remote repositories, browsers could give an incentive for webdevelopers to use integrity-hashes: If they add an integrity hash to a whitelisted library, the library does not have to be downloaded (and could even be stored as byte-code in the browser), so latency is reduced.
To remove the remaining privacy implications, the browsers could specify canonical URLs to retrieve the libraries (you already trust your browser not to track you, so this does not add another leak). Then those libraries would never be downloaded and you could not track users by what they already accessed.
Whether the shared cache should be used or not is to be controlled by the website using the resource and not by the server who first delivered the resource.
Reading this again after two years, I disagree. Whether a shared cache should be used should be controlled by the browser; not by the first delivering server and not by the website that requires it. The website could only make sure that the library is loaded first from the site by providing its own version with its own integrity hash.
@ArneBab bundling-with-the-browser is the only privacy-preserving option I've seen, but it has its own share of issues, such as favoring incumbents, who gets to be bundled, etc.
@annevk while these issues exist, they are not much different from the question which emerging standards a browser implements — but without the compatibility problems, because the cost of not being supported would just be somewhat longer load-times on the first load.
However you do not have to bundle: The requirement for privacy is either bundled-with-the-browser or downloaded-from-browser-defined-canonical-url. To prevent tracking by correlation of libraries at the canonical URL OR the websites, you can for each library choose at random between canonical URL and website. That massively decreases the number of reliably detectable distinct combinations of libraries.
I don't think that works if a user of a given group visits only one (sensitive) site with a given library. I also suspect sites would not want their performance to be variable.
From a privacy perspective, could we make it so that the resource is loaded from each origin at least once (if for no other reason than to verify that the SRI hash is valid). The browser could still then only cache one instance of it (and re-use whatever compilation cache etc that it deems relevant) but only stores that information once (and with various weightings etc the file may persist in cache for longer).
This removes some of the benefit that user agents could get from a "first load" perspective, but solves the privacy issue and keeps some of the other benefits.
As a side note, this could actually be implemented without the use of SRI hashes. If the browser links together identical files based on contents (eg stored against a hash), then it could perform this kind of optimisation irrespective of whether the website declares SRI hashes.
@MatthewSteeples which benefits remain? If the browser only downloads but skips compilation, the privacy problems resurface via timing attacks.
@ArneBab while theoretically possible, we're talking about a one-shot attempt to time how long it took the browser to compile something. You couldn't do repeated measurements to benchmark the speed of the device, or know what else was happening at the same time, so I'm not sure how reliable the numbers would be unless you're targeting a significantly large JS file. Would the same be true for CSS files?
If it's still too much of a privacy risk, you could still have the battery benefit by just sleeping for how long the compilation took last time
@MatthewSteeples they could provide other files with intentional changes to benchmark the browser during the access, and sleeping can be detected, because it can speed up the other compiles.
So you don’t really win much in exchange for giving up the benefits of not accessing the site at all. For CSS files this is true, too. As an example you can take this page with minimal resources which shows significant parse-time in Firefox.
But it would be possible to provide real privacy with a browser-provided whitelist and canonical URLs. That keeps the benefit of already having the file locally most of the time.
So the core question is: if you download (and compile, because otherwise this is detectable), even though you have the file locally, which benefits remain? Are there benefits that remain?
A shared cache does definitely bring a lot of advantages (faster sites, less data usage for the user, less network usage for the ISPs, browsers could cache the compiled/interpreted files, etc).
From what I read in this thread, the main pushback is the privacy concern that a specific user could be tracked by checking whether he has a specific file cached or not, meaning that we can know if the user visited a site (or same site) before that had the same file included.
The solutions I see for the privacy concerns:
I think that the shared cache is a lot better from a privacy point of view than including the resources from a 3rd party domain. So, although it allows some sort of tracking, it is still a step forward from just having all the websites linking to the same file on a CDN.
You're misreading. The main pushback is the security concern.
The privacy concern is already existing for CDNs and browsers are fighting it. Safari is doing it, Firefox will: Resources (like CDN stuff) will land in a per-top-level-website ("first party") cache, that will make the bandwidth and speed wins from a CDN void.
Safari calls it "partitioned cache" Firefox calls it First Party Isolation. https://github.com/whatwg/fetch/issues/904 has some standards-specific context.
I'm afraid this will never be.
@hillbrad wrote a great document outlining the privacy and security risks of shared caching: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html
(@annevk asked me to unlock the conversation. I'm not too hopeful about seeing new information in this 5 year old thread.)
I don't think there is a fundamental problem here that makes this impossible?
For example, if browsers were to cache popular libraries such as React and Vue, then this wouldn't pose any problems, correct?
If we can find a technique ensuring that only popular library code is cached (instead of unique app code), then we solve the problem, right? (I'm assuming that Subresource Integrity Addressable Caching covers all known issues).
Could we maybe reopen this ticket? I'd argue that as long as we don't find a fundamental blocker, then having a cross-origin shared cache is still open for consideration.
The benefits would be huge... it seems very well worth it to further dig.
(I'm assuming that Subresource Integrity Addressable Caching covers all known issues).
The other comment just before my last one has a newish - imho fundamental - blocker. Browsers are already partitioning their cache per first-level site (eventually more granular. Maybe per origin, or per frame-tree).
This issue just turned 7 years old. I'll leave this issue closed because nobody has managed to come up with an idea since.
New issues are cheap. I'm still happy to discuss new and specific proposals - I just currently do not believe those to exist.
We've had a lot of discussions about using SRI for shared caching (see https://lists.w3.org/Archives/Public/public-webappsec/2015May/0095.html for example). An explicit issue was filed at w3c/webappsec#504 suggesting a sharedcache attribute to imply that shared caching is OK. We should consider leveraging SRI for more aggressive caching.