Closed annevk closed 4 years ago
To clarify, this is about the CRLF in the HTTP-message production.
Covered here: http://httpwg.org/specs/rfc7230.html#message.robustness
... but I agree it could be more visible (e.g., in 3).
And clients also accept lone CR. It seems worth testing either that they reject (network error) or adding that.
@annevk see #68 - in situ, see https://httpwg.org/http-core/draft-ietf-httpbis-messaging-latest.html#message.parsing
I'm still not sure it's obvious enough to someone reading in isolation; thoughts?
@mnot that text does not treat CR without LF as a newline, unless I'm missing something. That text even seems to suggest that CRCRLF would be a single newline, whereas I'm pretty sure it ends up being treated as two newlines.
@annevk I'm pretty sure CR alone is not valid, because I remember such a discussion a few years ago by the time haproxy used to accept this, after which I removed support for this. CRCRLF could at best be considered as a CR at the end of a line, though normally it's not a valid character there. It may possibly be replaced by LWS though.
Just found it, its in RFC7230#3.5 : https://trac.tools.ietf.org/html/rfc7230#section-3.5
Although the line terminator for the start-line and header fields is the sequence CRLF, a recipient MAY recognize a single LF as a line terminator and ignore any preceding CR.
Right, I realize it's not valid, but I also know that clients nevertheless accept it, which is why I raised this.
All clients?
If they don't, they're not interoperable on the Web as we know it.
The Web as we know it does not contain any CR-only line-delimited protocol services other than those deliberately attempting response smuggling. The protocol specifically forbids bare CR as a line delimiter, since that is known to not be interoperable with the vast majority of protocol parsers that exclusively check for LF and ignore CR. This was a big deal in 1994, but even then it was obvious that CR-delimited messages wouldn't make it through proxies. Nobody has objected to it since OS X was released and the last of the native Mac servers disappeared.
If some browsers interpret CR as a line delimiter in HTTP protocol fields (other than payload content), they have security holes that should be fixed. No change to the RFC is possible here; we can't make changes that introduce new security holes to previously compliant parsers.
I wrote some tests for this: https://github.com/web-platform-tests/wpt/pull/13524. Suggestions for more tests appreciated. Firefox seems to generally take the CR and include it in the value of whatever preceded it (reason phrase or header value, despite those not allowing this character). Chrome and Safari treat it as LF in those places, but they do not allow two CRs to end the header block (network error in Safari, empty body in Chrome).
It does seem that a stricter stance is possible here, given the intersection of the results.
cc @ddragana @MattMenke2
If we can be stricter here and just fail on CRs in headers, that sounds good to me. That having been said, I think changing behavior to anything but just rejecting the response would be worse than the status quo (Treating CRs as part of header values, for instance).
Yeah, it seems like CR CR could be treated as a network error at least. CR "after" reason phrase or individual header is less clear. But I agree that making CR part of the value isn't good.
Regarding at the end of the line, we [currently say](https://httpwg.org/http-core/draft-ietf-httpbis-messaging-latest.html#message.parsing:
Although the line terminator for the start-line and header fields is the sequence CRLF, a recipient may recognize a single LF as a line terminator and ignore any preceding CR.
Inside the value, it could be obs-fold
, talked about here, provided that it has the appropriate whitespace before and after. That section currently requires:
A user agent that receives an obs-fold in a response message that is not within a message/http container must replace each received obs-fold with one or more SP octets prior to interpreting the field value.
CRLF
in the ABNF (and prose) needs to consider the appearance of just one of those characters.Finally, CR
could appear in a header value without surrounding whitespace as obs-fold
requires. It isn't allowed by field-content
here, and we say:
A recipient should treat other octets in field content (obs‑text) as opaque data.
CR
here -- e.g., a warning, a recommendation, or a requirement regarding handling it in content?OP's issue is specifically about how a H1 response/request is parsed into something more concrete and what we should consider an error there. https://github.com/httpwg/http-core/issues/31#issuecomment-429887526 goes into this best I think (prior comments from me were not informed by tests). One suggestion there is that sometimes CR can maybe be turned into an error.
Now as for what's valid in a value, I think that's best discussed and described separately from H1 parsing requirements as never H versions do not have the same parser limitations.
So, if I read that correctly, we could consider explicitly stating that CR
on its own is opaque data as far as h1 message parsing is concerned.
We must not consider it as opaque data or it risks to introduce a new class of issues which is that depending where it appears on the line it will be passed as-is or silently dropped. It may even happen that some proxies finding it before a comma would keep it and others unfolding the value into multiple headers would drop it.
I'm really in favor of clearly stating that it's strictly forbidden anywhere in the header block, except if it precedes an LF character in which case both constitute a CRLF end of line marker. By the way I think it's really not widespread: after probably more than a decade with this isolated character rejected by haproxy we haven't had any single report of an abusive blocking, while we've got such occasional reports of rejects of forbidden characters in header field names for example. So I think we should keep this strict.
Another point is that the text "ignore any CR before LF" can make one think that multiple CR could be ignored (as the CRCRLF Anne suggested), which was apparently not the intent of the original text, and which apparently led to this by trying to simplify the wording for "LF or CRLF".
Reject the message or convert to SP?
E.g., something like:
One or more CR characters not succeeded by LF MUST NOT be generated; recipients MUST either reject a message containing them, or convert them to SP before processing or forwarding the message.
We used to call this a bare CR (CR not immediately followed by LF).
+1 on that handling.
Next question is where to put this. I actually think it needs to be in two places:
Make sense?
Popular clients handle LF, CR, and CRLF as newlines. HTTP only defines CRLF.
Previously: https://lists.w3.org/Archives/Public/ietf-http-wg/2014JulSep/0123.html.