Closed Coder-256 closed 2 years ago
Hm, I'm surprised at the results, but then again, I shouldn't be, the HTTP1 spec has been ignored at every turn :D
Curious, do you know what other "libraries" do, like curl or golang?
I haven't tested this, but it looks like golang
errors on headers missing a colon but otherwise just parses it as a valid header
I also have confirmed that curl
actually parses it as a valid header:
$ curl "$URL" -s -o /dev/null -D -
HTTP/1.1 200 OK
Date: Tue, 10 Sep 2019 18:50:02 GMT
Server: Apache/2.4.xx (Debian)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http: //www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Transfer-Encoding: chunked
I took a quick look but its code is very dense. The header parsing logic is around Curl_http_readwrite_headers
if you want to look further. It should be noted that curl
doesn't actually do anything with headers except for important ones (ex. Transfer-Encoding
). In fact AFAIK the most it will do in terms of parsing arbitrary headers is give you the entire message header with -D
like above.
It was a good idea to check these libraries, I didn't realize that they would handle things differently. I guess there are 3 behaviors to choose from:
curl
and golang
)httparse
)I'll need to think a bit more about which would be better. What do you think?
Edit: After thinking about it a bit more, I think that especially since this project is used by apps like servo
, it should definitely mimic Chrome/Firefox. This is a real example of a page that wouldn't load even though it works in major browsers.
@seanmonstar Have you had a chance to look into this? I'll start working on a pull request, I just wanted to check if you agree that this is the correct behavior.
Oh my, I completely forgot about this issue, sorry!
I suppose it makes sense to do what browsers do. I have a nagging worry that if ignore invalid headers, is there a point where we're parsing things that really aren't HTTP and saying it is?
Well if the status line is intact, including HTTP version and all, I think it's definitely supposed to be HTTP. The headers are only parsed if the status line is valid.
I'll start making the PR.
FWIW this behaviour was completely killed from Squid last year.
https://github.com/squid-cache/squid/commit/b453677bc1de131d88bf865d01afdc69dcef37a2
@Coder-256 Did you ever make a patch for this?
Found another website affected by this.
curl -I https://dbegoals.azdot.gov/
00000000 58 e2 80 90 46 72 61 6d 65 e2 80 90 4f 70 74 69 |X...Frame...Opti|
00000010 6f 6e 73 3a 20 44 45 4e 59 |ons: DENY|
The hyphens aren't ASCII in the first X-Frame-Options
entry.
@nox I completely forgot! Also incredible catch, U+2010 strikes again. Anyway, after looking at this further, I am convinced that simply skipping invalid headers is the right option. In this case, Chromium and Firefox both ignore the invalid header while it looks like Safari actually treats the header name as a valid Latin 1-encoded header for whatever reason; none of them fail parsing entirely, as httparse does. I'll open a PR.
https://github.com/seanmonstar/httparse/pull/114 will let you ignore the invalid header.
@nox Thank you very much, lgtm! Sorry I didn’t get to this sooner.
@Coder-256 No worries.
I am trying to use
reqwest
to parse a response from a server I don't control.reqwest
useshyper
which useshttparse
for parsing HTTP/1.x headers. Anyway, this server has a weird bug where it consistently returns a single corrupted header line in an otherwise completely valid response (the header contains unescaped non-token characters). Specifically, for some reason it tries to send theDOCTYPE
as a header. The bug is unlikely to be fixed (this is old software), but it isn't really a problem because the page displays fine in all major browsers.It seems that all major browsers simply ignore invalid header lines. However,
httparse
returns an error that aborts the entire parsing process. IMO this is a problem and should be fixed.Here's a screenshot from Chrome that shows the invalid header being ignored:
In fact, Chrome's behavior is commented as: "skip malformed header".
Although technically changing this could be breaking, in this case, I can't imagine that any code would rely on response parsing to fail in this particular case.
Here are the relevant lines:
https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L594-L595
https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L613-L614
https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L672
https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L613-L614
I think all of these would be fixed by consuming
b
until the next newline thencontinue 'headers
. I would open a PR but I just want to check that you agree that this change should be made.