FramedRead / poll_read_buf don't keep track of initialized buffer parts

Fabian-Gruenbichler commented 3 years ago

using FramedRead with a big frame size can lead to a high overhead/low performance because it does not keep track of already initialized, reused buffer/ReadBuf parts. if the AsyncRead implementation at the end of the stack needs to initialize the ReadBuf, but only fills it partially, a lot of time can be spent just needlessy re-initializing this already initialized, unfilled part of the buffer.

for example, this affected hyper + h2 + tokio-openssl until the latter worked around it in version 0.6.1 (this work around is only possible for AsyncRead implementations that don't actually require to initialize the ReadBuf, so it's not a solution in general).

an example repo is available here: https://github.com/Fabian-Gruenbichler/hyper-regression (now closed) hyper issue: https://github.com/hyperium/hyper/issues/2389

IMHO it might be advisable to add a warning to the ReadBuf::uninit docs that warn about the potential performance implications of repeatedly calling it with (potentially) the same underlying buffer without keeping track of its initialization state, like tokio-util does.. ideally tokio(-util) would find a way to actually keep track somwhere (and call assume_init accordingly) ..

Darksonn commented 3 years ago

There is no sound way to do this because we call decode_eof after every call to poll_read, but decode_eof may de-initialize any of the initialized bytes. Since the unsafe code in this case would be in tokio-util, it is us that has to guard against such non-unsafe code from causing undefined behavior.

Note that de-initializing the memory is somewhat tricky due to the return value of chunk_mut, but it can be done since decode_eof has an &mut BytesMut to the buffer, so it can overwrite it with another BytesMut.

Fabian-Gruenbichler commented 3 years ago

Note that de-initializing the memory is somewhat tricky due to the return value of chunk_mut, but it can be done since decode_eof has an &mut BytesMut to the buffer, so it can overwrite it with another BytesMut.

so in FramedImpl's poll_next, instead of calling poll_read_buf to re-do the buffer to ReadBuf via chunk_mut, we'd want to re-use a ReadBuf, provided that neither decode nor decode_eof pulled out state.buffer under us, and the Readbuf still has remaining capacity >0?

Darksonn commented 3 years ago

We can't check whether decode pulled out the state.buffer from under us. They might deallocate the BytesMut we give them, and then allocate a new BytesMut and be lucky that it happens to have the same address as the one they just deallocated. This is not unlikely either — the allocator is probably pretty likely to give us the same memory back if we just deallocated it.

Fabian-Gruenbichler commented 3 years ago

hmm. that means that the current implementation of Framed and poll_read_buf has a fairly high overhead for bigger frames when the AsyncRead source needs to initialize the passed ReadBuf. might still be worth a mention in the docs somewhere so that implementors (by not initializing ReadBufs unless they absoluty have to) and users (by using smaller frame sizes when needed) can avoid this? the ReadBuf docs make initialize_unfilled() sound cheap, and the Framed/poll_read_buf implementations also don't warn about large frame sizes / buffers being potentially that problematic (again, this is not about initializing the big frame/buffer/.. once as would be expected, but over and over for each partial read!)..

Darksonn commented 3 years ago

Yes, but there isn't really anything we can do about it. Documentation is always nice.

aidanhs commented 2 years ago

Having just spent a few days tracking this down, this definitely seems like a footgun worth fixing - because I had a Compat wrapper and Compat requires initialization of buffers (so it can pass a &mut [u8] to a futures::io::AsyncRead), I was seeing a 5x slowdown due to memset until I added buffering! Though the problem is more general, this subtle interaction between Compat and Framed felt particularly bad.

i.e. roughly (with some simplifcations)

Framed::poll_next
 -> calls FramedImpl::poll_next
 -> calls poll_read_buf
 -> calls Compat::poll_read (tokio::io::AsyncRead)
   -> uses buf.initialize_unfilled
 -> calls (inner)::poll_read (futures_io::AsyncRead::poll_read)

I've created a minimised reproduction at https://github.com/aidanhs/tokio-framed-slow/blob/master/src/main.rs that writes 1 million 2MB structs over a tokio duplex (used as an in-memory pipe), with an adapter on the read end of the pipe that limits each read to 32 bytes. There are two scenarios

framed wrappers around each end of the pipe
framed wrapper around the write end of the pipe, a tokio::io::BufReader on the read end of the pipe https://github.com/aidanhs/tokio-framed-slow/blob/master/src/main.rs#L143, then framed wrapper around that BufReader

The output of this is:

t1 14.395206912s
t2 8.960023102s

i.e. it's 50% slower unless you know to add a buffer when sending large frames with Compat. Like I say though, this is a small example - I was seeing 5x slowdowns in my larger project.

On the solution: not sure what was possible when this issue was raised, but looking at the code now I think there's a better approach than docs, though it requires some API changes - in short, replace the usage of BytesMut and BufMut in FramedRead and poll_read_buf with tokio::ReadBuf. Because a tokio::ReadBuf does keep track of initialised parts, it can avoid the re-initialization.

Darksonn commented 2 years ago

On the solution: not sure what was possible when this issue was raised, but looking at the code now I think there's a better approach than docs, though it requires some API changes

Well, as you say, it requires some API changes, which would be a breaking change. I supposed we could do that, but we really do try to avoid them. As for your suggestion, that particular suggestion isn't going to work - a ReadBuf doesn't have the APIs necessary to replace BytesMut.

aidanhs commented 2 years ago

To further the discussion and try out what I was thinking of, I created a very hacky PoC of changes to the tokio decoder API at https://github.com/tokio-rs/tokio/compare/master...aidanhs:aphs-faster-codec-api. This is buildable with zero changes in end-user applications that use Framed (though does require changes in decoder implementations themselves). For demo purposes it just uses the convenient VecWithInitialized type, since that already has built-in conversion to ReadBuf. It also comments out large portions of tokio just to get a running demo.

The improvement (t1 is without buf, t2 is with):

$ git clone https://github.com/aidanhs/tokio-framed-slow.git
$ cd tokio-framed-slow
$ cargo run --release # before changes
[...]
t1 14.497464619s
t2 8.699344611s
$ git checkout aphs-tokio-codec-hacks
$ cargo run --release # after changes
[...]
t1 8.635227421s
t2 9.413384816

i.e. using an appropriate type that can generate ReadBufs gets unbuffered reads to the same order of magnitude as buffered - as we'd hope, given Framed has an internal buffer. For the slowdown for buffered operation - it's probably due to an inefficient memcpy-per-frame that got added (and would be removed in a more complete implementation by not using VecWithInitialized).

As for your suggestion, that particular suggestion isn't going to work - a ReadBuf doesn't have the APIs necessary to replace BytesMut.

I didn't understand this comment until I got to implementing myself, so to elaborate for other readers - Decoder implementations are expected to do some amount of buffer management (e.g. reserving new space) but a ReadBuf is on top of a slice, so has no owned memory to resize.

Well, as you say, it requires some API changes, which would be a breaking change. I supposed we could do that, but we really do try to avoid them.

I really do sympathise, but right now the API design makes it extremely hard to avoid unpredictable performance degradation when using FramedRead and Compat - you can try to use a tokio::io::BufReader, but if your buffer is too small then the BufReader gets bypassed and you're back to the memset overhead. So a user would actually need a hand-rolled bufreader that will never be bypassed just to act as a mediator.

tokio-rs / tokio

FramedRead / poll_read_buf don't keep track of initialized buffer parts #3417