httpwg / http-core

Core HTTP Specifications
https://httpwg.org/http-core/
467 stars 43 forks source link

Updating stored headers #165

Closed mnot closed 4 years ago

mnot commented 6 years ago

7234 requires stored headers to be updated upon a 304 or HEAD response.

However, there are a number of issues:

It seems like a lot of the bugs referenced above originated from the confusing language around omitting entity headers in RFC2616; that was removed in RFC7234 (see current language).

I think we need to:

Mozilla's list of headers they don't update might be a good starting point for the first two (see also related bug).

There are also browser tests and an outstanding PR.

Note that this is a security issue, because some browsers filter out headers starting with Content-, which includes Content-Security-Policy -- i.e., a browser that has an old copy of the response won't see the new CSP header on a 304 response.

(this might also apply to 206 responses, since there's header combination there too)

mnot commented 6 years ago

cc @mikewest @erickinnear

mikewest commented 6 years ago

@andypaicu and @mattmenke have more informed opinions than I do at this point. :)

annevk commented 6 years ago

I think you want @MattMenke2.

MattMenke2 commented 6 years ago

Thanks, Anne! MattMenke is also me, just my personal account (Which I prefer not to use for work. Lawyers scare me). So do please do use @MattMenke2.

MattMenke2 commented 6 years ago

Chrome's lists can be found at https://cs.chromium.org/chromium/src/net/http/http_response_headers.cc?gsn=PERSIST_RAW&g=0&l=47. Chrome has two lists: Those it shouldn't updated when using cached response after receiving a 304 (kNonUpdateHeaders + kNonUpdatedHeaderPrefixes), and those it doesn't put in the cache at all (kHopByHopResponseHeaders+ kChallengeResponseHeaders+ kCookieResponseHeaders + kSecurityStateHeaders + "Content-Range" + headers explicitly indicated by a no-cache header)

Chrome's no-update list was recently modified to match FireFox's, with a few extra rules. In particular, Chrome's list includes "www-authenticate", "x-frame-options", "x-xss-protection", "x-content-", and "x-webkit-".

"www-authenticate" definitely makes sense in the list (Well, I assume no browser caches proxy challenge responses, anyways, but it makes sense if the proxy challenge header does).

It seems to me like "x-frame-options" shouldn't be in that list, as it could reasonably be generated based on the requesting origin, rather than give a list of all origins (Both to keep down response sizes, and to avoid exposing other URLs)

"x-xss-protection" doesn't have that problem, but seems strange to single it out as not updated, unless there's a good reason to treat it differently.

I'm not sufficiently familiar with the "x-content-" and "x-webkit-" prefixes to comment usefully on them.

MattMenke2 commented 6 years ago

Also, it seems weird to have "Proxy-Authorization" on the list, as it's a response header. Though it we want to keep it for legacy reasons, we should probably include "Authorization", too.

mnot commented 6 years ago

Re: Proxy-Authorization and Authorization: my inclination would be to not specify request headers in the list, as they don't have any function in responses (specified or practical), and so it would be strange to specify them, and especially strange to specify some but not others.

If implementations felt like they wanted to still omit them because they don't remember why they were omitted and were wary of changing, that's fine, but I'd point out that the intermediary caches that do update stored headers don't do this.

mnot commented 6 years ago

I'll update my PR in WPT as well as the CDN tests to include more headers, so we can capture exactly what's current practice here.

MattMenke2 commented 6 years ago

Just to clarify, it looks to me FireFox does currently omit Proxy-Authorization, which is where Chrome apparently got its behavior from. See line 901 of https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpResponseHead.cpp#879. Or did you mean that FireFox will definitely remove that line?

mnot commented 6 years ago

@MattMenke2 I edited :)

It'd be great to understand why they do so. @mcmanus any idea?

mcmanus commented 6 years ago

don't know.. its older than the 2007 version control cutover - so you really have to badly want the answer to go looking.

Im going to guess that they were enumerating hop by hop headers and felt Proxy* should always fall into that bucket.

jyasskin commented 5 years ago

Is there general agreement in this issue to add two lists to draft-ietf-httpbis-cache:

  1. a list of headers to avoid caching at all, and
  2. a list to avoid updating on 304 or HEAD responses?

I do see that the contents of those lists don't yet necessarily have consensus.

mnot commented 5 years ago

I think it's one unified list. Where do you think it'd be different?

jyasskin commented 5 years ago

https://github.com/httpwg/http-core/issues/165#issuecomment-436287105 indicates that the set of headers that Chrome avoids caching is a different set from the one it avoids updating. (e.g. Chrome caches ETag but doesn't update it.) Thus, a need to represent two sets in the specification.

MattMenke2 commented 5 years ago

Updating Content-Encoding or Content-Location also seems pretty weird, though both should be cached.

mnot commented 5 years ago

I can see a case for Proxy-Authenticate, since that's often going to be dependent upon the connection properties. I don't see it for an already cacheable WWW-Authenticate.

Are there any other headers that you think should be considered (whether or not we have separate lists)?

The common case for these ones that appear weird is a server that is trying to correct a previous mistake; e.g., it sent the wrong header, and wants to update the cached entry.

annevk commented 5 years ago

It would be helpful if someone could put down a tentative proposal to compare with implementations and headers with interesting properties. (E.g., it's not clear to me from the above discussion if not being able to update Content-Type and Content-Length, while being able to cache them is seen as good by all, or not.)

mnot commented 5 years ago

I think we can do it in groups -

Any other headers of interest? I'm reluctant to start opting headers out of update just because "it seems weird", since that's an open-ended set, and unlike with browsers, we can't expect intermediary caches to be updated every time somebody comes up with a new header.

annevk commented 5 years ago

Note that having to respect Content-Type updates has implications, such as always having to store the body rather than an optimized structure. And changing encodings isn't without security issues either. (A browser will always have to deal with that anyway I suppose, as one can use fetch() to get the bytes from the cache.)

MattMenke2 commented 5 years ago

Content-Encoding seems problematic - want to get passed a malware scanner of some type? Service a resource with the wrong Content-Encoding, possibly get it in the browser cache (Which is generally at a lower layer than decompression), though the resource fails to load. Then re-request the resource with a 304 that changes the Content-Encoding. Admittedly, service workers and the like are perfectly capable of using their own methods to bypass filters.

MattMenke2 commented 5 years ago

Also I don't think "Some browsers may want to ignore updates to this header" is really a great option, going forward - I think we want to be consistent here, unless there's a pretty compelling reason not to.

mnot commented 5 years ago

I understand, but that's not a HTTP cache -- you're inserting non-standard functions on top of a HTTP cache for implementation convenience. Restricting the behaviour of all HTTP caches (including intermediaries and server-side caches) because some browser caches might be doing extra things like this doesn't make sense.

Even for those browsers, there are other strategies that they might take -- e.g., invalidating a cache entry that's an optimised structure if the content-type changes. Likewise, a browser that stores decoded responses needs to be aware of the semantics of content-encoding anyway, and take appropriate steps to behave correctly.

MattMenke2 commented 5 years ago

I'm not following...What "extra things" are you talking about?

mnot commented 5 years ago

I.e., storing and updating these headers works just fine until you start doing non-specified things with the HTTP cache. If you do such things, it's your responsibility to make sure that it doesn't interfere with the proper operation of the cache.

Re-defining caching to suit current browser behaviour might work if browser caches were the only HTTP caches in existence, but they're not.

MattMenke2 commented 5 years ago

Those two points seem to contradict each other - only one can be correct. If Content-Encoding is a property of the content, and not merely the transfer of the content, then caches should store that content, as opposed to a mutated version of it, no?

C-E being a property of content is a fiction - that may be what the spec says, but that's not the reality. You could not make a functional browser that behaved otherwise. I thought one of the purposes of the fetch spec was to reconcile RFCs with the way things actually work, or can reasonably be modified to work.

mnot commented 5 years ago

HTTP caches do store that content -- at least in intermediaries. If browser caches store the decompressed content, they need to account for the impact of doing so, as it's not specified behaviour.

In the case you described, if a browser stores something with an unrecognised content-coding thereby evading a filter, and then a 304 updates the content-coding, there are a number of valid strategies that could be taken to assure correct behaviour by the filter:

Why the changing the behaviour of all caches (including all that are currently deployed) to accommodate this non-standard behaviour the most reasonable path, especially when there are a number of ways they can correctly behave within the constrains of the current spec?

I could see an argument that HTTP needs to accommodate caches that store responses without content-encodings, or provide advice for those that do. Would that be helpful?

jyasskin commented 5 years ago

I'm working on a patch to add these lists, with notes that the contents aren't final. I'll include the "cached but not updated" list, probably saying that caches MAY ignore updates to headers in that list. As Mark convinces the browsers to update various headers on that list, or the browsers convince the CDNs not to, we can tighten the specification, but MAY seems to capture the status quo.

MattMenke2 commented 5 years ago

mnot: I think we're speaking past each other. I said, from the start, that Chrome stored the still compressed body. I assume this is what most things do. I pointed out that changing content-encoding is a potential bypass for malware scanners used by enterprises, which often are in the form of SSL-decrypting/re-encrypting proxies (And have plenty of other issues), since they don't have access to browser caches. You could send a response with content-encoding: identity, and then change its content-heading in a followup 304 response, to bypass such logic.

More generally, changing Content-Encoding, or Content-Length, for that matter, seems like a fundamentally weird and unexpected thing to do, with no real use case - if you're sending the wrong Content-Encoding, you can just update the Etag, it seems so unlikely to not be worth worrying about. Generally, from a security and cross-browser compatibility standpoint, browsers should be fairly restricted in what they allow. If it were up to me, we'd just hard fail if we receive those two headers, and they don't match cached values, though even with that behavior, there's the issue of what "match" means (i.e., "Content-Length: foo" vs "Content-Length: foo ").

jyasskin commented 5 years ago

181 creates the two lists, with notes that the contents aren't final. @mnot, if you only agree that we need a list of uncached headers, and aren't willing to merge an initial list of non-updated headers, I'm willing to remove the other half of the change for now.

mnot commented 5 years ago

Stepping back, I gathered some data; see: https://cache-tests.fyi/#update304 (as the disclaimer at the top says, this is still very much a work in progress)

Based upon those results so far, and discounting implementations which don't update headers uniformly (a separate kind of bug), a few observations:

annevk commented 5 years ago

If I remember correctly @ddragana mentioned Content-Range to me as another potentially problematic header.

mnot commented 5 years ago

Content-Range makes sense as an exception (partial content is so special).

wtarreau commented 5 years ago

I haven't seen Set-Cookie mentioned here, while I noticed it cause some awful cross-session leaks a decade ago. We even implemented a specific blocking option in haproxy to prevent it from being delivered in otherwise cacheable responses.

Also one way to think about all of this is to consider a cache accessible both over H1 and H2+, itself fed from both H1 and H2+ origins. I don't see anything particular resulting from this aside the possible lack of content-length that may require chunks to be used, but others might have some ideas.

Eventhough you already suggested to remove any header listed in "Connection", we should probably explicitly suggest to remove the "Keep-alive" header field which is still quite present in a number of responses, because I predict the list will most often be hard-coded anyway and most implementations won't parse the Connection header to figure this list ;-)

mnot commented 5 years ago

Hey Willy - see the end of my comment above regarding Set-Cookie.

What I find really interesting is that many caches appear to store and reuse a response with Set-Cookie, but the linked results above show that they exempt them from updating a Set-Cookie from a 304, which is odd.

If they were aligned so that it weren't cached or updated, I could see that; I could also see saying that a response with Set-Cookie shouldn't be subjected to heuristic freshness.

I'll create some more tests to probe here; I suspect it's a separate issue, though.

wtarreau commented 5 years ago

Oops it's amazing that I managed to miss it because I've read all your list (well, apparently except the last bullet). I agree that it should be a separate issue if various caches exhibit a different behaviour. From what I've seen in field, applications which have been hit take extreme care to mark responses not cacheable when they have a set-cookie. But intermediaries on the path may add this set-cookie header without considering the impacts (it's how I met the issue the first time, when you stack a cache, a load balancer and a farm and you see unfair loads, chances are that the LB's set-cookie gets cached).

mcmanus commented 5 years ago

ietf104: discussed jeffey's PR which discusses what headers should be pruned, or handled specially. #181 .. mnot asked for a discussion of what headers should be on the list and notes that set-cookie(2) are generally stored on disk but we need exploration of why

mcmanus commented 5 years ago

iet104: concern about putting things on the list without a firm reason why. General agreement that connection based headers should be on the list for obvious reasons. Caches should agree on consistent behavior.

mnot commented 5 years ago

I think Jeffrey's two-part proposal is roughly on the right lines, but needs some tweaks.

The "non-cached headers" list needs to be slightly broader -- it's really "headers that are not cached or forwarded," and should contain just:

  1. Anything listed in Connection
  2. Upgrade
  3. Connection
  4. Keep-Alive
  5. Proxy-Connection
  6. TE
  7. Trailers

This is effectively re-introducing hop-by-hop headers. We dropped them in 723x in favour of just relying on Connection, but that doesn't seem to be working out; most implementations are working from a hardcoded list of header field names, probably because in the real world, some senders don't bother to list them in Connection.

If we re-introduce them, we can then explicitly say that hop-by-hop headers are not cached (including any listed in Connection for h1), as 2616 did.

We probably need to do an audit about how we talk about Connection in our header definitions.

Then, we need to talk about how updates are affected by headers. Rather than listing ones that can be exempted, I'd like to propose we say something like this:

Due to their semantics, updating some header fields can result in the cache's state becoming inconsistent or invalid, depending on how it is implemented. For example, updating Content-Location might make a cached response incorrect. Or, updating Content-Range might be unrealistic after partial responses have been combined.

Likewise, changing the value of a header field might have external effects that the cache cannot account for. For example, if a user-agent cache stores a pre-rendered artefact instead of the raw bytes of a HTTP response, changing the Content-Type post facto is not possible (and possibly dangerous).

In these situations, a cache SHOULD NOT simply ignore the updated header; doing so could make the response inconsistent. Rather, it SHOULD NOT update the cache with the response at all, satisfying any outstanding request with the new response (in the case of a HEAD request), or in the case of a conditional request, either a stale stored response, or a new response obtained by issuing a non-conditional request.

Thoughts?

MikeBishop commented 5 years ago

So effectively, the presence of a non-updatable header poisons the entire 304 response, and the advice is to try again with a non-conditional request or pretend you weren't able to check validity. I like that this makes the protocol cleaner and that it provides a disincentive to servers to play header-change games. However, we can't ignore the fact that there are existing servers that do this, and this will be a performance penalty for them if browsers adopt this harder stance. I question whether real browsers would be willing to ignore a 304 in that circumstance.

Any browser owners that might care to comment?

annevk commented 5 years ago

That sounds like a pretty drastic change from the status quo that I'm not sure is really worth the amount of effort involved. Especially as it leaves much of the design challenges up to the implementers and will eventually result in implementers having to reverse engineer their respective setups to get consistent behavior across sites.

mnot commented 5 years ago

The alternative is to replace the third paragraph with something like:

In these limited situations, a cache MAY omit the headers listed below from updates. Servers SHOULD NOT send updated values for these headers in a 304 response if their new values are critical to interpret the cached response; rather, they SHOULD generate a full 2xx response.

However, I'm really only comfortable with this approach if the list is limited and relatively static. There's a lot of cargo culting apparent in the current implementation behaviours, and I see no reason to accommodate them.

Specifically:

twifkak commented 4 years ago

Sorry for the late contribution to this issue. I ran into a related ambiguity recently and thought it was relevant:

IIUC, the combination would mean that if the original response had a Cache-Control: max-age=, then any subsequent 304s (which must have Date) would extend the lifetime of the stored response, potentially indefinitely if 304s arrive every max_age-1.

Apologies if I misread something; I should really run some tests.

(edit to add: Either way, the decision on request time should probably be specified explicitly.)

mnot commented 4 years ago

Discussed in Sinapore; I will create a conservative PR as a basis of further discussion.

martinthomson commented 4 years ago

@agrover pointed to the gecko code that lists: Connection, Proxy-Connection, Keep-Alive, WWW-Authenticate, Proxy-Authenticate, Trailer, Transfer-Encoding, Upgrade, Set-Cookie.

I couldn't see anything that strips Content-Type, or even Content-<anything>, which might be an issue, but Andy might be able to provide more info.

agrover commented 4 years ago

OK so the previous list was headers Gecko doesn't cache. This is a list of headers Gecko doesn't update on a 304 (or 206 partial content). Connection Proxy_Connection Keep_Alive Proxy_Authenticate Proxy_Authorization TE Trailer Transfer_Encoding Upgrade Content_Location Content_MD5 ETag Content_Encoding Content_Range Content_Type Content_Length

So there are those Content- headers you were looking for. Sorry for the confusion.

mnot commented 4 years ago

See draft PR #337.

This covers all of the connection headers (both listed in Connection and also well-known ones that might not appear there), and the enumerated Content- headers (NOT the prefixes).

It includes Proxy-Authenticate and Proxy-Authorization with the theory that they are specific to a connection; we have an outstanding issue #331 to discuss that.

It also includes ETag; I want to do a bit of side research on ETag and updates, and may add some more text there or open a separate issue, but I think it's worth including at this point.

It does not include Content-Location. I'd like to understand the rationale for it not being updatable, considering that intermediary caches (Squid, ATS, and Varnish) are widely updating it with no perceivable impact on interop or security.

It does not yet include Cookie, because a cacheable response with a Cookie should really be updatable, if the cache bothered to store it. Looking at my tests, it appears that the only cache that stores and updates cookies is httpd. I suppose we can carve out an exception for Cookie -- it's exceptional in so many other ways...

Thoughts?

MattMenke2 commented 4 years ago

The problem with Cookie headers isn't replacing the cached ones, but with using the cached Cookie headers in the first place - we don't want to reuse Cookie headers from old responses, setting the old Cookies once again.

mnot commented 4 years ago

If that's the goal, it seems like it'd be better to do it in the cookie spec -- i.e., specify that cookies should only be used when the response is what we used to call "first hand" -- obtained directly from the origin server, without an Age, rather than from cache. @mikewest @johnwilander ?

mikewest commented 4 years ago

If that's the goal, it seems like it'd be better to do it in the cookie spec

This is currently taken care of by Fetch, which hooks into the cookie spec at 11.4 of https://fetch.spec.whatwg.org/#http-network-fetch (which happens after consulting the cache and missing). The integration could be much cleaner, but it does what you're asking for. What would you like added to the cookie spec to make that more transparent?

MattMenke2 commented 4 years ago

So sending cookies hooks in above the cache layer, and setting them hooks in above it, in the spec? That seems unfortunate.