Newline parsing in H1 - Githubissues

annevk commented 7 years ago

Popular clients handle LF, CR, and CRLF as newlines. HTTP only defines CRLF.

Previously: https://lists.w3.org/Archives/Public/ietf-http-wg/2014JulSep/0123.html.

annevk commented 6 years ago

To clarify, this is about the CRLF in the HTTP-message production.

mnot commented 6 years ago

Covered here: http://httpwg.org/specs/rfc7230.html#message.robustness

... but I agree it could be more visible (e.g., in 3).

annevk commented 6 years ago

And clients also accept lone CR. It seems worth testing either that they reject (network error) or adding that.

mnot commented 6 years ago

@annevk see #68 - in situ, see https://httpwg.org/http-core/draft-ietf-httpbis-messaging-latest.html#message.parsing

I'm still not sure it's obvious enough to someone reading in isolation; thoughts?

annevk commented 6 years ago

@mnot that text does not treat CR without LF as a newline, unless I'm missing something. That text even seems to suggest that CRCRLF would be a single newline, whereas I'm pretty sure it ends up being treated as two newlines.

wtarreau commented 6 years ago

@annevk I'm pretty sure CR alone is not valid, because I remember such a discussion a few years ago by the time haproxy used to accept this, after which I removed support for this. CRCRLF could at best be considered as a CR at the end of a line, though normally it's not a valid character there. It may possibly be replaced by LWS though.

wtarreau commented 6 years ago

Just found it, its in RFC7230#3.5 : https://trac.tools.ietf.org/html/rfc7230#section-3.5

Although the line terminator for the start-line and header fields is the sequence CRLF, a recipient MAY recognize a single LF as a line terminator and ignore any preceding CR.

annevk commented 6 years ago

Right, I realize it's not valid, but I also know that clients nevertheless accept it, which is why I raised this.

reschke commented 6 years ago

All clients?

mnot commented 6 years ago

If they don't, they're not interoperable on the Web as we know it.

royfielding commented 6 years ago

The Web as we know it does not contain any CR-only line-delimited protocol services other than those deliberately attempting response smuggling. The protocol specifically forbids bare CR as a line delimiter, since that is known to not be interoperable with the vast majority of protocol parsers that exclusively check for LF and ignore CR. This was a big deal in 1994, but even then it was obvious that CR-delimited messages wouldn't make it through proxies. Nobody has objected to it since OS X was released and the last of the native Mac servers disappeared.

If some browsers interpret CR as a line delimiter in HTTP protocol fields (other than payload content), they have security holes that should be fixed. No change to the RFC is possible here; we can't make changes that introduce new security holes to previously compliant parsers.

annevk commented 6 years ago

I wrote some tests for this: https://github.com/web-platform-tests/wpt/pull/13524. Suggestions for more tests appreciated. Firefox seems to generally take the CR and include it in the value of whatever preceded it (reason phrase or header value, despite those not allowing this character). Chrome and Safari treat it as LF in those places, but they do not allow two CRs to end the header block (network error in Safari, empty body in Chrome).

It does seem that a stricter stance is possible here, given the intersection of the results.

cc @ddragana @MattMenke2

MattMenke2 commented 6 years ago

If we can be stricter here and just fail on CRs in headers, that sounds good to me. That having been said, I think changing behavior to anything but just rejecting the response would be worse than the status quo (Treating CRs as part of header values, for instance).

annevk commented 6 years ago

Yeah, it seems like CR CR could be treated as a network error at least. CR "after" reason phrase or individual header is less clear. But I agree that making CR part of the value isn't good.

mnot commented 4 years ago

Regarding at the end of the line, we [currently say](https://httpwg.org/http-core/draft-ietf-httpbis-messaging-latest.html#message.parsing:

Although the line terminator for the start-line and header fields is the sequence CRLF, a recipient may recognize a single LF as a line terminator and ignore any preceding CR.

[ ] That seems to make it pretty clear it's not part of the value. If you think it could be more clear (or said somewhere else), please say so.

Inside the value, it could be obs-fold, talked about here, provided that it has the appropriate whitespace before and after. That section currently requires:

A user agent that receives an obs-fold in a response message that is not within a message/http container must replace each received obs-fold with one or more SP octets prior to interpreting the field value.

[ ] We should probably consider whether CRLF in the ABNF (and prose) needs to consider the appearance of just one of those characters.

Finally, CR could appear in a header value without surrounding whitespace as obs-fold requires. It isn't allowed by field-content here, and we say:

A recipient should treat other octets in field content (obs‑text) as opaque data.

[ ] Is it worth specifically addressing CR here -- e.g., a warning, a recommendation, or a requirement regarding handling it in content?

annevk commented 4 years ago

OP's issue is specifically about how a H1 response/request is parsed into something more concrete and what we should consider an error there. https://github.com/httpwg/http-core/issues/31#issuecomment-429887526 goes into this best I think (prior comments from me were not informed by tests). One suggestion there is that sometimes CR can maybe be turned into an error.

Now as for what's valid in a value, I think that's best discussed and described separately from H1 parsing requirements as never H versions do not have the same parser limitations.

mnot commented 4 years ago

So, if I read that correctly, we could consider explicitly stating that CR on its own is opaque data as far as h1 message parsing is concerned.

wtarreau commented 4 years ago

We must not consider it as opaque data or it risks to introduce a new class of issues which is that depending where it appears on the line it will be passed as-is or silently dropped. It may even happen that some proxies finding it before a comma would keep it and others unfolding the value into multiple headers would drop it.

I'm really in favor of clearly stating that it's strictly forbidden anywhere in the header block, except if it precedes an LF character in which case both constitute a CRLF end of line marker. By the way I think it's really not widespread: after probably more than a decade with this isolated character rejected by haproxy we haven't had any single report of an abusive blocking, while we've got such occasional reports of rejects of forbidden characters in header field names for example. So I think we should keep this strict.

Another point is that the text "ignore any CR before LF" can make one think that multiple CR could be ignored (as the CRCRLF Anne suggested), which was apparently not the intent of the original text, and which apparently led to this by trying to simplify the wording for "LF or CRLF".

mnot commented 4 years ago

Reject the message or convert to SP?

mnot commented 4 years ago

E.g., something like:

One or more CR characters not succeeded by LF MUST NOT be generated; recipients MUST either reject a message containing them, or convert them to SP before processing or forwarding the message.

royfielding commented 4 years ago

We used to call this a bare CR (CR not immediately followed by LF).

+1 on that handling.

mnot commented 4 years ago

Next question is where to put this. I actually think it needs to be in two places:

Semantics 4.4 Field Values
Messaging 2.2 Message Parsing

Make sense?

httpwg / http-core

Newline parsing in H1 #31