Open Fabian-Gruenbichler opened 3 years ago
There is no sound way to do this because we call decode_eof
after every call to poll_read
, but decode_eof
may de-initialize any of the initialized bytes. Since the unsafe code in this case would be in tokio-util
, it is us that has to guard against such non-unsafe code from causing undefined behavior.
Note that de-initializing the memory is somewhat tricky due to the return value of chunk_mut
, but it can be done since decode_eof
has an &mut BytesMut
to the buffer, so it can overwrite it with another BytesMut
.
Note that de-initializing the memory is somewhat tricky due to the return value of
chunk_mut
, but it can be done sincedecode_eof
has an&mut BytesMut
to the buffer, so it can overwrite it with anotherBytesMut
.
so in FramedImpl
's poll_next
, instead of calling poll_read_buf
to re-do the buffer to ReadBuf
via chunk_mut
, we'd want to re-use a ReadBuf
, provided that neither decode
nor decode_eof
pulled out state.buffer
under us, and the Readbuf
still has remaining capacity >0?
We can't check whether decode
pulled out the state.buffer
from under us. They might deallocate the BytesMut
we give them, and then allocate a new BytesMut
and be lucky that it happens to have the same address as the one they just deallocated. This is not unlikely either — the allocator is probably pretty likely to give us the same memory back if we just deallocated it.
hmm. that means that the current implementation of Framed
and poll_read_buf
has a fairly high overhead for bigger frames when the AsyncRead
source needs to initialize the passed ReadBuf
. might still be worth a mention in the docs somewhere so that implementors (by not initializing ReadBufs unless they absoluty have to) and users (by using smaller frame sizes when needed) can avoid this? the ReadBuf docs make initialize_unfilled()
sound cheap, and the Framed/poll_read_buf implementations also don't warn about large frame sizes / buffers being potentially that problematic (again, this is not about initializing the big frame/buffer/.. once as would be expected, but over and over for each partial read!)..
Yes, but there isn't really anything we can do about it. Documentation is always nice.
Having just spent a few days tracking this down, this definitely seems like a footgun worth fixing - because I had a Compat
wrapper and Compat
requires initialization of buffers (so it can pass a &mut [u8]
to a futures::io::AsyncRead
), I was seeing a 5x slowdown due to memset until I added buffering! Though the problem is more general, this subtle interaction between Compat
and Framed
felt particularly bad.
i.e. roughly (with some simplifcations)
Framed::poll_next
-> calls FramedImpl::poll_next
-> calls poll_read_buf
-> calls Compat::poll_read (tokio::io::AsyncRead)
-> uses buf.initialize_unfilled
-> calls (inner)::poll_read (futures_io::AsyncRead::poll_read)
I've created a minimised reproduction at https://github.com/aidanhs/tokio-framed-slow/blob/master/src/main.rs that writes 1 million 2MB structs over a tokio duplex (used as an in-memory pipe), with an adapter on the read end of the pipe that limits each read to 32 bytes. There are two scenarios
tokio::io::BufReader
on the read end of the pipe https://github.com/aidanhs/tokio-framed-slow/blob/master/src/main.rs#L143, then framed wrapper around that BufReader
The output of this is:
t1 14.395206912s
t2 8.960023102s
i.e. it's 50% slower unless you know to add a buffer when sending large frames with Compat
. Like I say though, this is a small example - I was seeing 5x slowdowns in my larger project.
On the solution: not sure what was possible when this issue was raised, but looking at the code now I think there's a better approach than docs, though it requires some API changes - in short, replace the usage of BytesMut
and BufMut
in FramedRead
and poll_read_buf
with tokio::ReadBuf
. Because a tokio::ReadBuf
does keep track of initialised parts, it can avoid the re-initialization.
On the solution: not sure what was possible when this issue was raised, but looking at the code now I think there's a better approach than docs, though it requires some API changes
Well, as you say, it requires some API changes, which would be a breaking change. I supposed we could do that, but we really do try to avoid them. As for your suggestion, that particular suggestion isn't going to work - a ReadBuf
doesn't have the APIs necessary to replace BytesMut
.
To further the discussion and try out what I was thinking of, I created a very hacky PoC of changes to the tokio decoder API at https://github.com/tokio-rs/tokio/compare/master...aidanhs:aphs-faster-codec-api. This is buildable with zero changes in end-user applications that use Framed
(though does require changes in decoder implementations themselves). For demo purposes it just uses the convenient VecWithInitialized
type, since that already has built-in conversion to ReadBuf
. It also comments out large portions of tokio just to get a running demo.
The improvement (t1
is without buf, t2
is with):
$ git clone https://github.com/aidanhs/tokio-framed-slow.git
$ cd tokio-framed-slow
$ cargo run --release # before changes
[...]
t1 14.497464619s
t2 8.699344611s
$ git checkout aphs-tokio-codec-hacks
$ cargo run --release # after changes
[...]
t1 8.635227421s
t2 9.413384816
i.e. using an appropriate type that can generate ReadBuf
s gets unbuffered reads to the same order of magnitude as buffered - as we'd hope, given Framed
has an internal buffer. For the slowdown for buffered operation - it's probably due to an inefficient memcpy-per-frame that got added (and would be removed in a more complete implementation by not using VecWithInitialized
).
As for your suggestion, that particular suggestion isn't going to work - a ReadBuf doesn't have the APIs necessary to replace BytesMut.
I didn't understand this comment until I got to implementing myself, so to elaborate for other readers - Decoder
implementations are expected to do some amount of buffer management (e.g. reserving new space) but a ReadBuf
is on top of a slice, so has no owned memory to resize.
Well, as you say, it requires some API changes, which would be a breaking change. I supposed we could do that, but we really do try to avoid them.
I really do sympathise, but right now the API design makes it extremely hard to avoid unpredictable performance degradation when using FramedRead
and Compat
- you can try to use a tokio::io::BufReader
, but if your buffer is too small then the BufReader
gets bypassed and you're back to the memset
overhead. So a user would actually need a hand-rolled bufreader that will never be bypassed just to act as a mediator.
using FramedRead with a big frame size can lead to a high overhead/low performance because it does not keep track of already initialized, reused buffer/ReadBuf parts. if the AsyncRead implementation at the end of the stack needs to initialize the ReadBuf, but only fills it partially, a lot of time can be spent just needlessy re-initializing this already initialized, unfilled part of the buffer.
for example, this affected hyper + h2 + tokio-openssl until the latter worked around it in version 0.6.1 (this work around is only possible for AsyncRead implementations that don't actually require to initialize the ReadBuf, so it's not a solution in general).
an example repo is available here: https://github.com/Fabian-Gruenbichler/hyper-regression (now closed) hyper issue: https://github.com/hyperium/hyper/issues/2389
IMHO it might be advisable to add a warning to the
ReadBuf::uninit
docs that warn about the potential performance implications of repeatedly calling it with (potentially) the same underlying buffer without keeping track of its initialization state, like tokio-util does.. ideally tokio(-util) would find a way to actually keep track somwhere (and callassume_init
accordingly) ..