httparse should skip invalid headers

Coder-256 commented 5 years ago

I am trying to use reqwest to parse a response from a server I don't control. reqwest uses hyper which uses httparse for parsing HTTP/1.x headers. Anyway, this server has a weird bug where it consistently returns a single corrupted header line in an otherwise completely valid response (the header contains unescaped non-token characters). Specifically, for some reason it tries to send the DOCTYPE as a header. The bug is unlikely to be fixed (this is old software), but it isn't really a problem because the page displays fine in all major browsers.

It seems that all major browsers simply ignore invalid header lines. However, httparse returns an error that aborts the entire parsing process. IMO this is a problem and should be fixed.

Here's a screenshot from Chrome that shows the invalid header being ignored:

In fact, Chrome's behavior is commented as: "skip malformed header".

Although technically changing this could be breaking, in this case, I can't imagine that any code would rely on response parsing to fail in this particular case.

Here are the relevant lines:

https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L594-L595

https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L613-L614

https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L672

https://github.com/seanmonstar/httparse/blob/6f696f5d027f35e11a70181c839b574e20335a74/src/lib.rs#L613-L614

I think all of these would be fixed by consuming b until the next newline then continue 'headers. I would open a PR but I just want to check that you agree that this change should be made.

seanmonstar commented 4 years ago

Hm, I'm surprised at the results, but then again, I shouldn't be, the HTTP1 spec has been ignored at every turn :D

Curious, do you know what other "libraries" do, like curl or golang?

Coder-256 commented 4 years ago

I haven't tested this, but it looks like golang errors on headers missing a colon but otherwise just parses it as a valid header

I also have confirmed that curl actually parses it as a valid header:

$ curl "$URL" -s -o /dev/null -D -
HTTP/1.1 200 OK
Date: Tue, 10 Sep 2019 18:50:02 GMT
Server: Apache/2.4.xx (Debian)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http: //www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Transfer-Encoding: chunked

I took a quick look but its code is very dense. The header parsing logic is around Curl_http_readwrite_headers if you want to look further. It should be noted that curl doesn't actually do anything with headers except for important ones (ex. Transfer-Encoding). In fact AFAIK the most it will do in terms of parsing arbitrary headers is give you the entire message header with -D like above.

It was a good idea to check these libraries, I didn't realize that they would handle things differently. I guess there are 3 behaviors to choose from:

Ignore invalid headers (like Chrome, Firefox, etc.)
Pass on invalid headers (like curl and golang)
Error on invalid headers (like httparse)

I'll need to think a bit more about which would be better. What do you think?

Edit: After thinking about it a bit more, I think that especially since this project is used by apps like servo, it should definitely mimic Chrome/Firefox. This is a real example of a page that wouldn't load even though it works in major browsers.

Coder-256 commented 4 years ago

@seanmonstar Have you had a chance to look into this? I'll start working on a pull request, I just wanted to check if you agree that this is the correct behavior.

seanmonstar commented 4 years ago

Oh my, I completely forgot about this issue, sorry!

I suppose it makes sense to do what browsers do. I have a nagging worry that if ignore invalid headers, is there a point where we're parsing things that really aren't HTTP and saying it is?

Coder-256 commented 4 years ago

Well if the status line is intact, including HTTP version and all, I think it's definitely supposed to be HTTP. The headers are only parsed if the status line is valid.

I'll start making the PR.

nox commented 3 years ago

FWIW this behaviour was completely killed from Squid last year.

https://github.com/squid-cache/squid/commit/b453677bc1de131d88bf865d01afdc69dcef37a2

nox commented 2 years ago

@Coder-256 Did you ever make a patch for this?

Found another website affected by this.

curl -I https://dbegoals.azdot.gov/

00000000  58 e2 80 90 46 72 61 6d  65 e2 80 90 4f 70 74 69  |X...Frame...Opti|
00000010  6f 6e 73 3a 20 44 45 4e  59                       |ons: DENY|

The hyphens aren't ASCII in the first X-Frame-Options entry.

Coder-256 commented 2 years ago

@nox I completely forgot! Also incredible catch, U+2010 strikes again. Anyway, after looking at this further, I am convinced that simply skipping invalid headers is the right option. In this case, Chromium and Firefox both ignore the invalid header while it looks like Safari actually treats the header name as a valid Latin 1-encoded header for whatever reason; none of them fail parsing entirely, as httparse does. I'll open a PR.

nox commented 2 years ago

https://github.com/seanmonstar/httparse/pull/114 will let you ignore the invalid header.

Coder-256 commented 2 years ago

@nox Thank you very much, lgtm! Sorry I didn’t get to this sooner.

nox commented 2 years ago

@Coder-256 No worries.

seanmonstar / httparse

httparse should skip invalid headers #61