Mid-stream error semantics?

tigt commented 3 years ago

The problem

I have a website that uses HTTP/1.1 chunked transfer-encoding to incrementally show results from asynchronous backend API calls in a streamed HTML response. This response routes through an NGiNX reverse-proxy, a CDN, and then an unknown number of gateways, middleboxes, and inspectors (like antivirus programs) before reaching the requesting user-agent.

Sometimes, a backend API call fails: the server’s connection to it closes unexpectedly, the backend emits an error of its own, or any other of the myriad ways computers and networks attack. By that time, I’ve already sent an HTTP status code and headers, but I really want the ability to tell any consuming clients that the stream encountered an error and the response should now be considered invalid, otherwise:

An HTTP cache may store the erroneous content and reuse it, showing users the error for longer than they otherwise would
Search engines will index the erroneous content, since they received no sign they should try again or discard the response as invalid
HTTP-level tools (debuggers, monitoring, curl, spiders, etc.) will report the response as successful, even though it wasn’t

Research/prior art

HTTP/0.9, /1.0

No mid-stream error signaling possible, meaning prematurely-terminated responses are indistinguishable from the normal request termination of closing the connection. This limitation presumably informed future requirements on bodies requiring either a `content-length` or `transfer-encoding: chunked` length indicators.

HTTP/1.1

IETF Draft: HTTP/1.1 Messaging §8 Handling Incomplete Messages

In theory, HTTP/1.1 provided a way to provide more error information via chunk extensions, but history produced no standard extensions and they were dropped for HTTP/2 and beyond.

If a chunked response doesn’t terminate with the zero-length end chunk, the client must assume that the response was incomplete — which at the very least, means a cache should double-check with the server before reusing the stored incomplete response. There are two ways to emit such an incomplete response:

Closing the TCP connection before any zero-length end chunk, which can be hard to convey to the user-agent since connection and associated information are assumed to be hop-by-hop. Additionally, this can have undesirable performance implications when proxying through gateways by tearing down warmed-up persistent connections, and it precludes adding HTTP-level debugging info in trailers, which seem the natural place to include it.
Writing invalid transfer-encoding framing, such as missing or incorrect hex-encoded chunk lengths. Middleboxes also understandably will truncate or attempt to repair such invalid responses, resulting in the user-agent running into the aforementioned problems.

HTTP/2

RFC 7540 §5.4.2 Stream Error Handling

An HTTP/2 stream can signal an application error by sending a RST_STREAM frame with an error code of 0x2 INTERNAL_ERROR… I think.

The following subsection §5.4.3. Connection Termination also suggests that premature closing of the TCP stream can signal an error, which is straightforward to translate from HTTP/1.1 but inherits the same issues.

SPDY

I would love to not have to think about SPDY at all, but many CDNs and similar gateways will transparently downgrade to SPDY for older user-agents. Luckily, SPDY’s semantics more or less map to HTTP/2’s: see [IETF draft: SPDY Protocol §2.4.2. Stream error handling](https://tools.ietf.org/id/draft-mbelshe-httpbis-spdy-00.txt); but [the hex code for `INTERNAL_ERROR` might be `0x6` instead](https://www.chromium.org/spdy/spdy-protocol/spdy-protocol-draft2#TOC-RST_STREAM)

HTTP/3

[HTTP/3 §8 Error Handling](https://quicwg.org/base-drafts/draft-ietf-quic-http.html#errors) seems to leave exact implementation open for experimentation, which is good overall but makes it harder for me to understand a recommendation for my case. “H3_INTERNAL_ERROR (0x0102)” seems ideal, but the error happening somewhere “in the HTTP stack” makes me wonder if it’s suitable for application-level use?

Gateways translating from earlier versions of HTTP might reasonably choose to surface the previous signaling methods such as malformed chunks as either “H3_FRAME_ERROR (0x0106)” or “H3_MESSAGE_ERROR (0x010e)” — should either of those be used in that scenario? The mapping between h2/h3 errors seems mostly concerned with mapping transport-level semantics.

I’m having a hard time understanding how QUIC would convey the same error information as prematurely-closed TCP connections when translating that signal from earlier HTTP versions. It does mention “the QUIC transport could indicate to the application layer that the connection has terminated”, but “could” does not suggest I can rely on that behavior.

So what?

Persisting mid-stream application errors through various HTTP versions seems like something core HTTP semantics should allow for.
Guidance on how to signal mid-stream errors is hard to find, and I could only find guidance on translating those signals from HTTP/2 to HTTP/3. This is exacerbated by reverse proxies usually not bothering with supporting upstream connections higher than HTTP/1.1.
Existing methods to signal mid-stream errors can easily cause performance problems or unexpected behavior when attempting to convey them all the way to the requester.
While it’s theoretically possible to propagate 1 bit of error information (“is this response bad and shouldn’t be reused?”), other HTTP-level error data, such as retry-after, seem valuable to reuse.

mnot commented 3 years ago

This is out of scope for the HTTP core effort -- it would be considered a new feature.

However, see this draft and resulting list discussion. After a discussion at our last interim, it seems like there's interest in discussing this general area (not only for caching, but also other purposes, potentially), but it still needs "time to bake."

Probably the best way to move things forward is to participate in discussion on-list. Having more use cases fleshed out beyond caching will help to scope the work.

kazuho commented 3 years ago

Regarding HTTP/3, the concerns and solutions were discussed in https://github.com/quicwg/base-drafts/issues/3300.

tigt commented 3 years ago

Gotcha. I’ll probably post on the mailing list once I collect my thoughts with a more complete proposal, but for the moment I really want to know what I should do in an CTE stream to tell requesters that the response should be considered questionably-cacheable/damaged/non-authoritative/other stuff that 5XX errors get by default.

httpwg / http-core