Open crusader-mike opened 6 years ago
Thanks for reporting the bug. Let me explain (from what I remember, it's been a while) how the reading from the request works if it's a POST request; it might help to understand better what's going on and pinpoint the issue:
As long as the whole request fits in the request buffer, it doesn't matter much how many segments it takes to fill it out. There's a limit in the number of calls to read() that can be performed, as a way to deter attacks such as slowloris, but it should be sufficient for most uses. When it is determined that it's a POST request, the function to read the request body is called again if the amount of data in the buffer is insufficient according to the Content-Length in the request headers. The request body for POST buffers is allocated dynamically, and if it's over a certain threshhold (a megabyte? I don't remember), a temporary file is created instead to hold it, and it's mapped into memory so the rest of the code should work as usual. There is some similar provision to limit the time to fill this buffer.
Increasing the default buffer size might be a solution, because both your request headers and request body might fit comfortably in that buffer. However, it's a paliative solution: the coroutine stack size must be increased as well (it's CORO_STACK_MIN, IIRC, inside lwan-coro.c), otherwise the server may crash. There's a static_assert() in the code but that doesn't seem to be sufficient.
The connection might be closed if an unrecoverable error happens when reading from the request socket, when reading the request headers and when reading the request body. Basically, every time a connection coroutine is yielded with with CONN_CORO_ABORT, its associated socket will be shut down.
Running Lwan under strace when receiving the POST request might shed some light why the connection is being closed; read_from_request_socket() will only abort if read() returns 0, or if it returns -1 and it's not an EAGAIN or EINTR error. Something else might be happening here that it's not being properly handled, and the connection ends up being closed prematurely.
I know exactly where it decides to close connection -- there are two "finalizers", one of them returns "buffer is full" error code if read_size == buffer_size
(which then later gets translated into 413). I didn't spend enough time digging through the code to figure out what and why, but it seems that the whole buffer handling logic is flawed -- besides points mentioned above, it also should fail if I use HTTP pipelining (i.e. submit multiple requests in a batch) if combined length of these requests is >4kb (again, depending on how network stack decides to fragment my data stream).
I'll make a guess -- there are (at least) two code paths that read from socket into a buffer: one reads data when content-length is known, another -- when it isn't. I bet the second one makes an assumption that there shouldn't be more than 4kb of data ready to be read (hence -- the difference in 'finalizer' functions behavior).
Indeed, the buffer handling is flawed. Pipelined requests fail when the combined requests are (or would be) over 4kb.
Your guess is almost there: one part reads the request (method, headers, etc). If it's determined it's a POST request, it will honor Content-Length. If there's enough data in the buffer, nothing is read again; if there isn't, whatever is in the 4kb buffer gets copied to a new buffer, and the rest is read from the socket until Content-Length bytes for the body is read.
I don't have much time to work on Lwan any longer, but I'll find some time to take a closer look at the buffer management. One of the things I've been thinking about is to just use a ring buffer, which will speed up pipelining and help with big requests (e.g. with large cookies).
I've created a trivial web server with lwan that for each POST request responds with http code 200 -- it is used for testing another system performance.
Noticed that when request POST reaches 4kb -- there is a good chance lwan suddenly close HTTP/1.1 connection without any good reason (it should stay open). Used wireshark and found that if request gets split into two different TCP segments (frames) with each one less than 4kb -- everything works fine. But if one of them is >4kb (or entire request was transferred in one frame) -- my lwan server responds with 413 code and Content-length ~1k, but there is no body -- connection gets RST'ed before body is sent down. On client end (since body was never received) I never observe 413 -- I observe "server suddenly dropped connection". Needless to say -- this is not a good behavior as I have no control over TCP segmentation.
After spending about an hour looking around lwan code I tracked the problem to DEFAULT_BUFFER_SIZE value. Alas, increasing it to 64k and recompiling lib and server (without problems) resulted in crashing server. But setting it to 16k seem to work ok -- it is fine for my needs, but isn't ideal in general case either.
Note: all this happened on CentOS 7 x64