seanmonstar / httparse

A push parser for the HTTP 1.x protocol in Rust.
https://docs.rs/httparse
Apache License 2.0
573 stars 113 forks source link

Parse from noncontiguous byte slices? #74

Open EkardNT opened 4 years ago

EkardNT commented 4 years ago

I am building a HTTP server on top of tokio that needs to perform minimal memory allocations per client TCP connection, regardless of how many HTTP requests are received through the connection. Therefore I have chosen to use a fixed-size circular buffer to store the raw request data read from the wire, and now I am trying to use httparse to parse the request information. The problem I have run into is that the Request.parse function takes in a single &[u8], but because I'm using a circular buffer I have two slices - one for the bytes in the remainder of the buffer, and one (optionally) for the bytes which wrapped around to the front of the buffer. This two-buffer approach works very well with vectored IO reads, but not so well so far with httparse.

At first I was hoping I that httparse's Request type would be persistent, so I could call parse in turn for both of the slices. But that appears to not be how the API works - it expects the one slice to have all the data, and when you call it a second time the same data should still be present, only with more added to the end.

Consequently, the only way I can find to use httparse today is to perform a copy of the data from the circular buffer into a secondary contiguous buffer. But the cost of such copying is potentially significant and I'd prefer to avoid it where possible. How feasible would it be to add some sort of parse_vectored function to httpparse which takes a &[std::io::IoSlice]?

dbrgn commented 4 years ago

I ran into the same issue / misconception. I assumed I can just read from a socket into a buffer, pass this buffer into response.parse(...) and then reset the buffer before reading again.

I think that misconception comes from the way the docs are worded:

response

From this, I assumed that the Response contains a growable buffer that will hold the entire response and I just feed it new bytes. However, that's now how it works.

If I understand the API correctly, the expectation is that the incoming bytes should be read into a buffer until \r\n\r\n is encountered (which indicates the end of the header). This buffer is then passed to Response::parse. The Response borrows the buffer. After inspecting the header fields (and e.g. checking content-length and/or transfer-encoding headers), the user can then drop the Response again (to free the borrow) and read the body bytes from the buffer / socket.

Is this correct? If yes, maybe the docs could be clarified :slightly_smiling_face:

Hawk777 commented 4 years ago

Actually, dropping the Response object doesn’t help, because calling Response::new followed by Response::parse connects the 'b and 'h lifetimes together, so that the buffer must outlive the array of Header objects. If you want to get your buffer back, you also have to drop the array of Headers. Or so I believe.

marcos687 commented 2 years ago

That's how httparse achieves zero-copy. You'll have to have an array of bytes containing the full message, then the Response object uses slices of information from that initial buffer array.

By doing this, it doesn't have to keep an inner state and can be faster than a non-zero-copy parser at the cost of you having to allocate that buffer array.

Hawk777 commented 2 years ago

For me, the annoyance is less that a successful Response::parse locks up the buffer—fine, it zero-copies the headers, so obviously it has to do that—and more that a failed Response::parse also locks up the buffer. Of course it’s always possible to just add, in my consumer, the “find two CRLFs” logic, but that’s kind of hacky and adds knowledge of low-level HTTP encoding details to an otherwise high-level consumer—not having to know/implement the details of HTTP message encoding is kind of why we want httparse in the first place. So, to me, the obvious first thing to try was “call Response::parse in a loop until it succeeds, receiving some more bytes each time it fails”. But that’s not possible because even a failed Response::parse locks up the buffer (by tying 'b and 'h together, so that you have to drop the array of Headers before you can touch the buffer again).

jkarneges commented 2 years ago

I use a circular buffer too, and I always wished httparse could parse from multiple source slices. It might be possible to do it in a way that avoids copying, by making the parsed fields comprised of multiple slices in case they wrap (method: Option<(&str, &str)>). However, this would make for a complex API and it would also make the code slower in the most common scenario (i.e. the first frame received over a connection, which wouldn't be wrapped). That may not be worthwhile.

What I do as a workaround is re-align the circular buffer contents in place before passing the buffer to httparse. This way any extra copying only occurs once per expected frame, as opposed to repeatedly copying the circular buffer into a temporary buffer for each parse attempt.