python-hyper / h11

A pure-Python, bring-your-own-I/O implementation of HTTP/1.1
https://h11.readthedocs.io/
MIT License
490 stars 62 forks source link

h11 fails on multiple targets where other HTTP clients work #95

Open fbexiga opened 4 years ago

fbexiga commented 4 years ago

Considering for example this snippet of code

def run(target):
    conn = h11.Connection(our_role=h11.CLIENT)
    sock = socket.create_connection((target, 80))
    request = h11.Request(method="GET", target="/", headers=[("Host", target)])
    data = conn.send(request)
    sock.sendall(data)
    data = sock.recv(2048)
    conn.receive_data(data)
    conn.next_event()
>>> run("100.33.56.173")
h11._util.RemoteProtocolError: multiple Content-Length headers
>>> run("220.181.136.243")
h11._util.RemoteProtocolError: Response status_code should be in range [200, 600), not 600

Other errors of the same kind that I've encountered include:

h11._util.RemoteProtocolError: malformed data
h11._util.RemoteProtocolError: Receive buffer too long

These are all basically ill-configured servers, sometimes even against protocol specs, but they actually appear a lot in the wild. I think these should work nonetheless, as most HTTP clients don't make these kinds of restrictions and they do allow users to see the underlying data despite the misconfigurations.

njsmith commented 4 years ago

Can you file individual bugs for the different issues? "h11 should work" is too vague to figure out actual code changes :-). The problem is to figure out what exactly servers are doing that h11 needs to support.

Multiple content-lengths already has an issue here: #92

What on earth is a "600" response? I've never heard of that.

"Malformed data" means that one of h11's parsing regexps failed. Need more details to figure out which one needs to be loosened and how.

"Receive buffer too long" probably means that the headers were >16384 bytes, which is the default max_incomplete_event_size: https://h11.readthedocs.io/en/latest/api.html#the-connection-object This is already configurable, though we could potentially change the default if there's a good reason. The current value is pretty arbitrary; I based it on looking at some HTTP servers and picked something in the same ballpark:

https://github.com/python-hyper/h11/blob/68e32dbb475d5241f077afaa278b6ef248b1f9bd/h11/_connection.py#L23-L32

Probably it would make sense to look at clients too, though. Apparently curl has a hardcoded limit of 102400? https://curl.haxx.se/mail/lib-2019-09/0023.html

fbexiga commented 4 years ago

The 600 response doesn't actually exist in the standard, it's something that whoever configured the server created. Nonetheless, shouldn't be a reason to reject the response.

These were found out using the httpx library. More examples and discussion here: https://github.com/encode/httpx/issues/767

njsmith commented 4 years ago

Here's an issue for one possible cause of the "malformed data" – not sure if it's the one you saw or not. (Or maybe you saw multiple, I dunno)

njsmith commented 4 years ago

Whoops, I meant this issue: https://github.com/python-hyper/h11/issues/97

cancan101 commented 3 years ago

LinkedIn is an example of where the status code >= 600 comes up in the wild. They are (in)famous for returning 999 status codes. See: https://stackoverflow.com/questions/27231113/999-error-code-on-head-request-to-linkedin or just run curl -I --url https://www.linkedin.com/company/linkedin