Closed mnot closed 4 years ago
cc @mikewest @erickinnear
@andypaicu and @mattmenke have more informed opinions than I do at this point. :)
I think you want @MattMenke2.
Thanks, Anne! MattMenke is also me, just my personal account (Which I prefer not to use for work. Lawyers scare me). So do please do use @MattMenke2.
Chrome's lists can be found at https://cs.chromium.org/chromium/src/net/http/http_response_headers.cc?gsn=PERSIST_RAW&g=0&l=47. Chrome has two lists: Those it shouldn't updated when using cached response after receiving a 304 (kNonUpdateHeaders + kNonUpdatedHeaderPrefixes), and those it doesn't put in the cache at all (kHopByHopResponseHeaders+ kChallengeResponseHeaders+ kCookieResponseHeaders + kSecurityStateHeaders + "Content-Range" + headers explicitly indicated by a no-cache header)
Chrome's no-update list was recently modified to match FireFox's, with a few extra rules. In particular, Chrome's list includes "www-authenticate", "x-frame-options", "x-xss-protection", "x-content-", and "x-webkit-".
"www-authenticate" definitely makes sense in the list (Well, I assume no browser caches proxy challenge responses, anyways, but it makes sense if the proxy challenge header does).
It seems to me like "x-frame-options" shouldn't be in that list, as it could reasonably be generated based on the requesting origin, rather than give a list of all origins (Both to keep down response sizes, and to avoid exposing other URLs)
"x-xss-protection" doesn't have that problem, but seems strange to single it out as not updated, unless there's a good reason to treat it differently.
I'm not sufficiently familiar with the "x-content-" and "x-webkit-" prefixes to comment usefully on them.
Also, it seems weird to have "Proxy-Authorization" on the list, as it's a response header. Though it we want to keep it for legacy reasons, we should probably include "Authorization", too.
Re: Proxy-Authorization and Authorization: my inclination would be to not specify request headers in the list, as they don't have any function in responses (specified or practical), and so it would be strange to specify them, and especially strange to specify some but not others.
If implementations felt like they wanted to still omit them because they don't remember why they were omitted and were wary of changing, that's fine, but I'd point out that the intermediary caches that do update stored headers don't do this.
I'll update my PR in WPT as well as the CDN tests to include more headers, so we can capture exactly what's current practice here.
Just to clarify, it looks to me FireFox does currently omit Proxy-Authorization, which is where Chrome apparently got its behavior from. See line 901 of https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpResponseHead.cpp#879. Or did you mean that FireFox will definitely remove that line?
@MattMenke2 I edited :)
It'd be great to understand why they do so. @mcmanus any idea?
don't know.. its older than the 2007 version control cutover - so you really have to badly want the answer to go looking.
Im going to guess that they were enumerating hop by hop headers and felt Proxy* should always fall into that bucket.
Is there general agreement in this issue to add two lists to draft-ietf-httpbis-cache:
304
or HEAD
responses?I do see that the contents of those lists don't yet necessarily have consensus.
I think it's one unified list. Where do you think it'd be different?
https://github.com/httpwg/http-core/issues/165#issuecomment-436287105 indicates that the set of headers that Chrome avoids caching is a different set from the one it avoids updating. (e.g. Chrome caches ETag
but doesn't update it.) Thus, a need to represent two sets in the specification.
Updating Content-Encoding or Content-Location also seems pretty weird, though both should be cached.
I can see a case for Proxy-Authenticate
, since that's often going to be dependent upon the connection properties. I don't see it for an already cacheable WWW-Authenticate
.
Are there any other headers that you think should be considered (whether or not we have separate lists)?
The common case for these ones that appear weird is a server that is trying to correct a previous mistake; e.g., it sent the wrong header, and wants to update the cached entry.
It would be helpful if someone could put down a tentative proposal to compare with implementations and headers with interesting properties. (E.g., it's not clear to me from the above discussion if not being able to update Content-Type
and Content-Length
, while being able to cache them is seen as good by all, or not.)
I think we can do it in groups -
Transfer-Coding
, Connection
, Keep-Alive
, Proxy-Authenticate
*, and Upgrade
are all hop-by-hop headers; i.e., they should be listed in Connection
(except Connection
itself). They should never be stored or updated.
Other headers listed in Connection
(for h1) should also never be stored or updated.
TE
, Proxy-Authorization
and maybe a couple of other headers are request hop-by-hop headers; they shouldn't be seen in responses. However, if they are sent in a response, they don't do anything, so I think they're harmless to leave.
ETag
is used to select a response when If-None-Match
is in use, so by definition it can't change on the response (there won't be anything to update if it does; see the rules). If an ETag
comes back in response to a request that only used If-Modified-Since
, I'd say the server was updating the ETag
-- but it's worth exploring / talking about if people want to cover that case. See also #132 for a possible complication.
For Content-*
my understanding is that this was exempted by developers who misread the language in 2616 about not including entity headers in 304s, and somehow took away that if they are present, they shouldn't be updated. That's not what it says, wasn't the intent, and as discussed, can cause horrible problems (such as discussed WRT Content-Security-Policy
).
Specifically, Content-Type
can be updated when people do things like get the encoding wrong using Apache's .htaccess
file.
Content-Length
is used to delimit messages in HTTP/1, and so causes people to get scared when it's messed with, for understandable reasons. There's also used to be some pretty pathological errors from a few implementations that would send Content-Length: 0
on a 304
response (because it doesn't have a body, right?). If an implementation wants to refuse to update Content-Length
due to security concerns, I'd understand that -- but if it's coming in H2 or greater, it might just mean that the server is trying to correct an erroneous C-L sent before that. shrug
Any other headers of interest? I'm reluctant to start opting headers out of update just because "it seems weird", since that's an open-ended set, and unlike with browsers, we can't expect intermediary caches to be updated every time somebody comes up with a new header.
Proxy-Authenticate
is defined as hop-by-hop, so no case need be made.Note that having to respect Content-Type
updates has implications, such as always having to store the body rather than an optimized structure. And changing encodings isn't without security issues either. (A browser will always have to deal with that anyway I suppose, as one can use fetch()
to get the bytes from the cache.)
Content-Encoding seems problematic - want to get passed a malware scanner of some type? Service a resource with the wrong Content-Encoding, possibly get it in the browser cache (Which is generally at a lower layer than decompression), though the resource fails to load. Then re-request the resource with a 304 that changes the Content-Encoding. Admittedly, service workers and the like are perfectly capable of using their own methods to bypass filters.
Also I don't think "Some browsers may want to ignore updates to this header" is really a great option, going forward - I think we want to be consistent here, unless there's a pretty compelling reason not to.
I understand, but that's not a HTTP cache -- you're inserting non-standard functions on top of a HTTP cache for implementation convenience. Restricting the behaviour of all HTTP caches (including intermediaries and server-side caches) because some browser caches might be doing extra things like this doesn't make sense.
Even for those browsers, there are other strategies that they might take -- e.g., invalidating a cache entry that's an optimised structure if the content-type changes. Likewise, a browser that stores decoded responses needs to be aware of the semantics of content-encoding anyway, and take appropriate steps to behave correctly.
I'm not following...What "extra things" are you talking about?
Content-Encoding
automagically; C-E is a property of the contentI.e., storing and updating these headers works just fine until you start doing non-specified things with the HTTP cache. If you do such things, it's your responsibility to make sure that it doesn't interfere with the proper operation of the cache.
Re-defining caching to suit current browser behaviour might work if browser caches were the only HTTP caches in existence, but they're not.
Those two points seem to contradict each other - only one can be correct. If Content-Encoding is a property of the content, and not merely the transfer of the content, then caches should store that content, as opposed to a mutated version of it, no?
C-E being a property of content is a fiction - that may be what the spec says, but that's not the reality. You could not make a functional browser that behaved otherwise. I thought one of the purposes of the fetch spec was to reconcile RFCs with the way things actually work, or can reasonably be modified to work.
HTTP caches do store that content -- at least in intermediaries. If browser caches store the decompressed content, they need to account for the impact of doing so, as it's not specified behaviour.
In the case you described, if a browser stores something with an unrecognised content-coding thereby evading a filter, and then a 304 updates the content-coding, there are a number of valid strategies that could be taken to assure correct behaviour by the filter:
Why the changing the behaviour of all caches (including all that are currently deployed) to accommodate this non-standard behaviour the most reasonable path, especially when there are a number of ways they can correctly behave within the constrains of the current spec?
I could see an argument that HTTP needs to accommodate caches that store responses without content-encodings, or provide advice for those that do. Would that be helpful?
I'm working on a patch to add these lists, with notes that the contents aren't final. I'll include the "cached but not updated" list, probably saying that caches MAY ignore updates to headers in that list. As Mark convinces the browsers to update various headers on that list, or the browsers convince the CDNs not to, we can tighten the specification, but MAY seems to capture the status quo.
mnot: I think we're speaking past each other. I said, from the start, that Chrome stored the still compressed body. I assume this is what most things do. I pointed out that changing content-encoding is a potential bypass for malware scanners used by enterprises, which often are in the form of SSL-decrypting/re-encrypting proxies (And have plenty of other issues), since they don't have access to browser caches. You could send a response with content-encoding: identity, and then change its content-heading in a followup 304 response, to bypass such logic.
More generally, changing Content-Encoding, or Content-Length, for that matter, seems like a fundamentally weird and unexpected thing to do, with no real use case - if you're sending the wrong Content-Encoding, you can just update the Etag, it seems so unlikely to not be worth worrying about. Generally, from a security and cross-browser compatibility standpoint, browsers should be fairly restricted in what they allow. If it were up to me, we'd just hard fail if we receive those two headers, and they don't match cached values, though even with that behavior, there's the issue of what "match" means (i.e., "Content-Length: foo" vs "Content-Length: foo ").
Stepping back, I gathered some data; see: https://cache-tests.fyi/#update304 (as the disclaimer at the top says, this is still very much a work in progress)
Based upon those results so far, and discounting implementations which don't update headers uniformly (a separate kind of bug), a few observations:
The Content-
and X-Content-
exceptions (including Content-Security-Policy
) are unique to Chrome and Safari. This supports the notion that this was a simple spec misinterpretation pre-Blink fork. many other implementations' support of these headers' update indicates it's safe to do so for caches on the Web.
Content-Type
, Content-MD5
and Content-Location
aren't updated by any browser (and I believe that Apache is failing for similar reasons, but haven't had a chance to dig in yet), but are by other caches. I'd like to understand more why these are special; I understand the reasoning behind Content-Type
, but AFAIK browsers don't use either Content-MD5
or Content-Location
to do anything meaningful.
X-Frame-Options
and X-XSS-Protection
aren't updated by any browser except Firefox, but are by other caches. Same question as above; what's the reasoning here? Is it similar to that for Content-Type
, where some state beyond a bag-of-bits is being stored in the cache?
Public-Key-Pins
and Clear-Site-Data
are both Chrome-specific.
Set-Cookie
deserves some discussion; it looks like it's pretty widely exempted from update (with only Squid doing so). Still, the fact that a widespread implementation does do it seems to indicate it works reasonably well on the Web...
If I remember correctly @ddragana mentioned Content-Range
to me as another potentially problematic header.
Content-Range
makes sense as an exception (partial content is so special).
I haven't seen Set-Cookie mentioned here, while I noticed it cause some awful cross-session leaks a decade ago. We even implemented a specific blocking option in haproxy to prevent it from being delivered in otherwise cacheable responses.
Also one way to think about all of this is to consider a cache accessible both over H1 and H2+, itself fed from both H1 and H2+ origins. I don't see anything particular resulting from this aside the possible lack of content-length that may require chunks to be used, but others might have some ideas.
Eventhough you already suggested to remove any header listed in "Connection", we should probably explicitly suggest to remove the "Keep-alive" header field which is still quite present in a number of responses, because I predict the list will most often be hard-coded anyway and most implementations won't parse the Connection header to figure this list ;-)
Hey Willy - see the end of my comment above regarding Set-Cookie
.
What I find really interesting is that many caches appear to store and reuse a response with Set-Cookie
, but the linked results above show that they exempt them from updating a Set-Cookie
from a 304, which is odd.
If they were aligned so that it weren't cached or updated, I could see that; I could also see saying that a response with Set-Cookie
shouldn't be subjected to heuristic freshness.
I'll create some more tests to probe here; I suspect it's a separate issue, though.
Oops it's amazing that I managed to miss it because I've read all your list (well, apparently except the last bullet). I agree that it should be a separate issue if various caches exhibit a different behaviour. From what I've seen in field, applications which have been hit take extreme care to mark responses not cacheable when they have a set-cookie. But intermediaries on the path may add this set-cookie header without considering the impacts (it's how I met the issue the first time, when you stack a cache, a load balancer and a farm and you see unfair loads, chances are that the LB's set-cookie gets cached).
ietf104: discussed jeffey's PR which discusses what headers should be pruned, or handled specially. #181 .. mnot asked for a discussion of what headers should be on the list and notes that set-cookie(2) are generally stored on disk but we need exploration of why
iet104: concern about putting things on the list without a firm reason why. General agreement that connection based headers should be on the list for obvious reasons. Caches should agree on consistent behavior.
I think Jeffrey's two-part proposal is roughly on the right lines, but needs some tweaks.
The "non-cached headers" list needs to be slightly broader -- it's really "headers that are not cached or forwarded," and should contain just:
Connection
Upgrade
Connection
Keep-Alive
Proxy-Connection
TE
Trailers
This is effectively re-introducing hop-by-hop headers. We dropped them in 723x in favour of just relying on Connection
, but that doesn't seem to be working out; most implementations are working from a hardcoded list of header field names, probably because in the real world, some senders don't bother to list them in Connection
.
If we re-introduce them, we can then explicitly say that hop-by-hop headers are not cached (including any listed in Connection
for h1), as 2616 did.
We probably need to do an audit about how we talk about Connection
in our header definitions.
Then, we need to talk about how updates are affected by headers. Rather than listing ones that can be exempted, I'd like to propose we say something like this:
Due to their semantics, updating some header fields can result in the cache's state becoming inconsistent or invalid, depending on how it is implemented. For example, updating
Content-Location
might make a cached response incorrect. Or, updatingContent-Range
might be unrealistic after partial responses have been combined.Likewise, changing the value of a header field might have external effects that the cache cannot account for. For example, if a user-agent cache stores a pre-rendered artefact instead of the raw bytes of a HTTP response, changing the
Content-Type
post facto is not possible (and possibly dangerous).In these situations, a cache SHOULD NOT simply ignore the updated header; doing so could make the response inconsistent. Rather, it SHOULD NOT update the cache with the response at all, satisfying any outstanding request with the new response (in the case of a HEAD request), or in the case of a conditional request, either a stale stored response, or a new response obtained by issuing a non-conditional request.
Thoughts?
So effectively, the presence of a non-updatable header poisons the entire 304 response, and the advice is to try again with a non-conditional request or pretend you weren't able to check validity. I like that this makes the protocol cleaner and that it provides a disincentive to servers to play header-change games. However, we can't ignore the fact that there are existing servers that do this, and this will be a performance penalty for them if browsers adopt this harder stance. I question whether real browsers would be willing to ignore a 304 in that circumstance.
Any browser owners that might care to comment?
That sounds like a pretty drastic change from the status quo that I'm not sure is really worth the amount of effort involved. Especially as it leaves much of the design challenges up to the implementers and will eventually result in implementers having to reverse engineer their respective setups to get consistent behavior across sites.
The alternative is to replace the third paragraph with something like:
In these limited situations, a cache MAY omit the headers listed below from updates. Servers SHOULD NOT send updated values for these headers in a 304 response if their new values are critical to interpret the cached response; rather, they SHOULD generate a full 2xx response.
However, I'm really only comfortable with this approach if the list is limited and relatively static. There's a lot of cargo culting apparent in the current implementation behaviours, and I see no reason to accommodate them.
Specifically:
Content-Type
, Content-MD5
, Content-Encoding
, Content-Length
, Content-Range
and maybe Content-Location
make sense to list, for reasons discussed.Content-
and X-Content-
don't make sense, and specifying them is going to create perverse incentives. Browsers should change their behaviours here.Set-Cookie
and Set-Cookie2
from updates; no proxy cache does it.Sorry for the late contribution to this issue. I ran into a related ambiguity recently and thought it was relevant:
IIUC, the combination would mean that if the original response had a Cache-Control: max-age=
, then any subsequent 304s (which must have Date
) would extend the lifetime of the stored response, potentially indefinitely if 304s arrive every max_age-1.
Apologies if I misread something; I should really run some tests.
(edit to add: Either way, the decision on request time should probably be specified explicitly.)
Discussed in Sinapore; I will create a conservative PR as a basis of further discussion.
@agrover pointed to the gecko code that lists: Connection, Proxy-Connection, Keep-Alive, WWW-Authenticate, Proxy-Authenticate, Trailer, Transfer-Encoding, Upgrade, Set-Cookie.
I couldn't see anything that strips Content-Type
, or even Content-<anything>
, which might be an issue, but Andy might be able to provide more info.
OK so the previous list was headers Gecko doesn't cache. This is a list of headers Gecko doesn't update on a 304 (or 206 partial content). Connection Proxy_Connection Keep_Alive Proxy_Authenticate Proxy_Authorization TE Trailer Transfer_Encoding Upgrade Content_Location Content_MD5 ETag Content_Encoding Content_Range Content_Type Content_Length
So there are those Content-
headers you were looking for. Sorry for the confusion.
See draft PR #337.
This covers all of the connection headers (both listed in Connection
and also well-known ones that might not appear there), and the enumerated Content-
headers (NOT the prefixes).
It includes Proxy-Authenticate
and Proxy-Authorization
with the theory that they are specific to a connection; we have an outstanding issue #331 to discuss that.
It also includes ETag
; I want to do a bit of side research on ETag
and updates, and may add some more text there or open a separate issue, but I think it's worth including at this point.
It does not include Content-Location
. I'd like to understand the rationale for it not being updatable, considering that intermediary caches (Squid, ATS, and Varnish) are widely updating it with no perceivable impact on interop or security.
It does not yet include Cookie
, because a cacheable response with a Cookie should really be updatable, if the cache bothered to store it. Looking at my tests, it appears that the only cache that stores and updates cookies is httpd. I suppose we can carve out an exception for Cookie -- it's exceptional in so many other ways...
Thoughts?
The problem with Cookie headers isn't replacing the cached ones, but with using the cached Cookie headers in the first place - we don't want to reuse Cookie headers from old responses, setting the old Cookies once again.
If that's the goal, it seems like it'd be better to do it in the cookie spec -- i.e., specify that cookies should only be used when the response is what we used to call "first hand" -- obtained directly from the origin server, without an Age
, rather than from cache. @mikewest @johnwilander ?
If that's the goal, it seems like it'd be better to do it in the cookie spec
This is currently taken care of by Fetch, which hooks into the cookie spec at 11.4 of https://fetch.spec.whatwg.org/#http-network-fetch (which happens after consulting the cache and missing). The integration could be much cleaner, but it does what you're asking for. What would you like added to the cookie spec to make that more transparent?
So sending cookies hooks in above the cache layer, and setting them hooks in above it, in the spec? That seems unfortunate.
7234 requires stored headers to be updated upon a
304
orHEAD
response.However, there are a number of issues:
x-content-*
,content-*
andx-webkit-*
- see Chrome and WebkitIt seems like a lot of the bugs referenced above originated from the confusing language around omitting entity headers in RFC2616; that was removed in RFC7234 (see current language).
I think we need to:
Mozilla's list of headers they don't update might be a good starting point for the first two (see also related bug).
There are also browser tests and an outstanding PR.
Note that this is a security issue, because some browsers filter out headers starting with
Content-
, which includesContent-Security-Policy
-- i.e., a browser that has an old copy of the response won't see the new CSP header on a 304 response.(this might also apply to 206 responses, since there's header combination there too)