Closed xiaoyawei closed 1 year ago
It's a heuristic to reduce the amount of copying, if data frame content is large. If the IO transport can support vectored writes, h2 will use writev to send both pieces without copying.
@seanmonstar
If the IO transport can support vectored writes, h2 will use writev to send both pieces without copying.
I checked the linux manual (https://linux.die.net/man/2/writev) and it says "the data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes"; however it looks like that writev
over a tcp socket on linux (ubuntu 20.4 for my case) does not really aggregate the 2 pieces together first and then send.
Would you be open to improve the heuristic? My proposal is
CHUNK_THRESHOLD
a little bit, to like 1024
X
bytes to buffer, and chain the rest of the payload; here X
is CHUNK_THRESHOLD - buf.remaining()
No matter what, the current performance is not ideal, for the scenarios where large amounts of small payload (around 1KB) are being transmitted
If you feel comfortable with my proposal or can suggest an alternative, I will be happy to implement it, submit a pull request, etc.
writev on a TCP socket will result it the data going together, unless the amount of data is too large for a single segment of course. However, if using TLS on top, some of the transports don't implement the is_write_vectored()
method.
I'm open to improving the throughput, certainly! I think that if vectored writes are supported, the threshold should stay small. But if it's not supported, it could definitely make sense to have up to a certain size buffering.
It'd help if there were some benchmarks that could show the improvement. hyper has some end-to-end benchmarks for different sizes and amounts of requests, that could be a good place to measure (or add a case that can then measure).
writev on a TCP socket will result it the data going together, unless the amount of data is too large for a single segment of course. However, if using TLS on top, some of the transports don't implement the is_write_vectored() method.
I didn't find much documents about vectorized write. In my environement, OS is Ubuntu 20.04.3 LTS; 2 packets are written at once, one of 9 bytes, the other of around 1000 bytes; TLS is not used.
It's possible that I have some inproper configs but after some researches, I didn't find what linux settings would control the behavior of writev
over a TCP socket. Also keep in mind that whether TCP_NODELAY
is set also affects the behavior, in my case TCP_NODELAY
is on.
I think that if vectored writes are supported, the threshold should stay small. But if it's not supported, it could definitely make sense to have up to a certain size buffering.
Good suggestion; but it might be easy to tell whether vectored io is support or not. For example, in the implementation of TcpStream
in tokio (https://github.com/tokio-rs/tokio/blob/37bb47c4a2aff8913e536767645772f15650e6cd/tokio/src/net/tcp/stream.rs#L1348-L1350), TcpStream::is_write_vectored()
always returns true no matter what.
It'd help if there were some benchmarks that could show the improvement.
I would certainly make sure to use benchmark to measure perf changes through benchmarks
for more information, I tested TcpStream::write_vectored()
from tokio
in my test environment and the write here is indeed atomic (all payload is concatenated); so looks like when using tonic
for grpc, the observed extra tcp segment might be due to some other stuff, I will keep it posted
@seanmonstar
I think I find (or get closer to) the root cause for my issue: tonic
does not really implement the vectored io part for its BoxIo
, which is a wrapper of the underlying TCP stream (https://github.com/hyperium/tonic/blob/2325e3293b8a54f3412a8c9a5fcac064fa82db56/tonic/src/transport/service/io.rs#L52-L52)
So looks like I can help with these stuff
Tcp::is_write_vectored()
in tokioLet me know what you think. :)
I think 1 and 3 would be great. Didn't you say you've already tested that 2 works?
@seanmonstar
For 2, looks like always return true (https://github.com/tokio-rs/tokio/blob/37bb47c4a2aff8913e536767645772f15650e6cd/tokio/src/net/tcp/stream.rs#L1348-L1350), but correct implementation shall depend on the underlying TcpStream.
Without 2, even if 1 and 3 are fixed, it won't help with the non-vectored case since h2
will assume the vectored IO is on, and uses default implementation of write_vectored()
, which actually sends each buffer one by one
I believe it always returns true because it has a specialized poll_write_vectored
, which uses writev
on the socket.
1 and 3 seem higher priority.
@seanmonstar
Cool, I will do 1 and 3 first.
However, for 2, I looked through the codebase again, and looks like its poll_write_vectored
is implemented as (https://github.com/tokio-rs/tokio/blob/37bb47c4a2aff8913e536767645772f15650e6cd/tokio/src/io/poll_evented.rs#L218-L230)
#[cfg(any(feature = "net", feature = "process"))]
pub(crate) fn poll_write_vectored<'a>(
&'a self,
cx: &mut Context<'_>,
bufs: &[io::IoSlice<'_>],
) -> Poll<io::Result<usize>>
where
&'a E: io::Write + 'a,
{
use std::io::Write;
self.registration.poll_write_io(cx, || self.io.as_ref().unwrap().write_vectored(bufs))
}
here self.io
is actually of type mio::net::TcpStream
, whose write_vectored
implementation is still based on the TcpStream
from ruststd
.
By the end of the day, if an OS / device does not support vector IO, I feel tokio::TcpStream
cannot do much to enable vectored write without manipulation of a bulk of memory. So I feel the is_write_vectored
still needs to be changed in tokio
Let me know if I misunderstood any stuff, thanks! ;)
What is the issue?
I was using
tonic
(which depends onh2
for http2) for a bidi-streaming gRPC service. With some packet sniffing, I noticed that when one side is sending data of around 1KB to the other side through the gRPC stream (which is based on an underlying http2 stream), the data is always transmitted as 2 TCP segments: the 1st one carries the http2 data frame's head , which is of 9 bytes, while the 2nd one carries the real payload.Since
tonic
(and many other libraries) disables nagle algorithm by default, the above issue happens in a relatively wide range. For us, our application keeps transmitting data around 1KB, and this issue leads to a lot of tcp segments being transmitted, which is a big overhead and affects our applications' performance.What is the root cause?
Yes,
h2
is agnostic to the underlying transport protocol, but I assume that most of use cases are built on top of TCP; the problematic code is herehttps://github.com/hyperium/h2/blob/da38b1c49c39ccf7e2e0dca51ebf8505d6905597/src/codec/framed_write.rs#L243-L251
When sending a data frame with payload size over
CHAIN_THRESHOLD
(which is 256 now), the frame head (9 bytes) is written to buffer, while the entire payload is left inself.next
; then when data is being transmitted over the transport,https://github.com/hyperium/h2/blob/da38b1c49c39ccf7e2e0dca51ebf8505d6905597/src/codec/framed_write.rs#L125-L135
the buffer is chained with the next token, which is finally written into the transport via
https://github.com/hyperium/h2/blob/da38b1c49c39ccf7e2e0dca51ebf8505d6905597/src/codec/framed_write.rs#L186-L186
here
buf
is of typebytes::buf::Chain
inbytes
crate, whosechunk()
method is implemented as followsTherefore, the buffer, which only contains the data frame's head, is first written to the transport (which is tcp in most use cases), and the payload is then written separately. In the scenarios where nagel algorithm is disabled, it'll lead to the that data frame's head and payload are transmitted in separate tcp segments, which causes some overhead.