python-hyper / h11

A pure-Python, bring-your-own-I/O implementation of HTTP/1.1
https://h11.readthedocs.io/
MIT License
490 stars 62 forks source link

Server wait forever on obviously incorrect but incomplete HTTP request #121

Closed touilleMan closed 3 years ago

touilleMan commented 3 years ago

I've encountered this behavior by initiating an HTTPS request with a client to a server expecting HTTP. So the server end up with a TLS "Client Hello" frame which doesn't contains the \r\n\r\n pattern that should mark the end of the header part of the HTTP request.

Hence h11 just consider the request is not over yet, hence the server starts waiting forever while the client is waiting for a TLS "Server Hello" answer :'(

https://github.com/python-hyper/h11/blob/522b00470b5776dda2dc7e98ff778eb1e418e6ad/h11/_readers.py#L70-L81

https://github.com/python-hyper/h11/blob/522b00470b5776dda2dc7e98ff778eb1e418e6ad/h11/_receivebuffer.py#L101-L112

A quick&dirty test on various websites (heroku.com using cowboy, yahoo.com using ATS and google.com) shows that parsing of the HTTP request start line occurs as soon as it is received (and so a 400 Invalid HTTP Request is returned instead of waiting indefinitely).

I didn't check if this behavior is specified in the RFC (I guess it's not ^^), however it seems a legit behavior... or at least something worth talking about ;-)

njsmith commented 3 years ago

shows that parsing of the HTTP request start line occurs as soon as it is received

The problem is that "it is received" isn't well-defined here, since the connection isn't sending an HTTP request start line of any sort. I think what might be happening is that those servers wait until the first time they see \n, and then try to parse whatever they saw as a request line? And you're getting lucky and the TLS ClientHello messages happen to contain a \n somewhere? There's a basic tradeoff here between parsing speed and detecting errors as quickly as possible: in principle one could write a parser in python that handles each byte fully incrementally as it arrives, and raises an error as soon as it sees the first byte that can't be part of an HTTP request. In practice, this would be ruinously slow...

And you already need a timeout in the server to reject other forms of bad requests that can't be detected even in principle. E.g. if a client connects and sends b"asdf" and then stops, there's no way an HTTP parser can possibly detect that that's a bad connection, because asdf is a valid HTTP method name, and if we never receive any more data after that then we never see anything that breaks the HTTP spec.

So I can see an argument for doing better here in cases where we can, but I don't think there's a slam-dunk obvious way to do it – the question is whether there's some heuristic that's cheap and reliable enough to be worth doing.

njsmith commented 3 years ago

Hmm. Though on some further research, it looks like TLS connections always start by the client sending the byte 0x16, which is not a printable ascii character and in particular is not allowed to be part of an HTTP method name. So I think we could use a heuristic that just looks at the first byte, and either validates that it's not 0x16 (super cheap), or that it's a byte that can legally appear in an HTTP token (slightly more thorough, slightly more complicated).

touilleMan commented 3 years ago

Thanks @njsmith for the quick answer ;-)

I think what might be happening is that those servers wait until the first time they see \n, and then try to parse whatever they saw as a request line?

Indedd my payload did contain some \r and \n (though no \r\n sequence), however I did a bit more tests and discovered something interesting:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.connect(("heroku.com", 80))
# s.sendall(b'\x00')  # `s.recv` hangs
s.sendall(b'\n')  # `s.recv` returns the HTTP 400 response
print(s.recv(15))

So on heroku.com (using cowboy server), things works just like you say (the server wait until it sees the first \n, then start processing what should be the HTTP request start line)

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.connect(("google.com", 80))
s.sendall(b"\x21")  # This cause the `s.recv` to hang
#  s.sendall(b"\x20")  # `s.recv` returns the HTTP 400 response
# s.sendall(b"\x21\x20")  # `s.recv` also hangs !
print(s.recv(15))

Google server checks for the first byte (and only the first byte) to make sure it is in the range 0x21-0xFF (it didn't check for the upper bound).

import socket
# s = socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.connect(("yahoo.com", 80))
s.sendall(b"\x01") # `s.recv` hangs
# s.sendall(b"\x00")  # `s.recv` returns the HTTP 400 response
# s.sendall(b"\x01\x00")  # `s.recv` returns the HTTP 400 response
print(s.recv(15))

Finally yahoo.com (using ATS) only abort early if the incoming buffer contains a 0x00 somewhere

So I think we could use a heuristic that just looks at the first byte, and either validates that it's not 0x16 (super cheap),

I was considering doing something like this as a workaround, but it seems like a super ad-hoc solution (0x16 is for protocol version, so it may change at anytime)

I thing we should instead rely on what Google is doing, given as you say checking for the first byte is super cheap and covers more error case than only checking for 0x16 ;-)

njsmith commented 3 years ago

Sure, that makes sense. Want to make a PR?

touilleMan commented 3 years ago

@njsmith #122 is here \o/