Sending data frame results in transmission of more tcp segments than needed

xiaoyawei commented 1 year ago

What is the issue?

I was using tonic (which depends on h2 for http2) for a bidi-streaming gRPC service. With some packet sniffing, I noticed that when one side is sending data of around 1KB to the other side through the gRPC stream (which is based on an underlying http2 stream), the data is always transmitted as 2 TCP segments: the 1st one carries the http2 data frame's head , which is of 9 bytes, while the 2nd one carries the real payload.

Since tonic (and many other libraries) disables nagle algorithm by default, the above issue happens in a relatively wide range. For us, our application keeps transmitting data around 1KB, and this issue leads to a lot of tcp segments being transmitted, which is a big overhead and affects our applications' performance.

What is the root cause?

Yes, h2 is agnostic to the underlying transport protocol, but I assume that most of use cases are built on top of TCP; the problematic code is here

https://github.com/hyperium/h2/blob/da38b1c49c39ccf7e2e0dca51ebf8505d6905597/src/codec/framed_write.rs#L243-L251

When sending a data frame with payload size over CHAIN_THRESHOLD (which is 256 now), the frame head (9 bytes) is written to buffer, while the entire payload is left in self.next; then when data is being transmitted over the transport,

https://github.com/hyperium/h2/blob/da38b1c49c39ccf7e2e0dca51ebf8505d6905597/src/codec/framed_write.rs#L125-L135

the buffer is chained with the next token, which is finally written into the transport via

https://github.com/hyperium/h2/blob/da38b1c49c39ccf7e2e0dca51ebf8505d6905597/src/codec/framed_write.rs#L186-L186

here buf is of type bytes::buf::Chain in bytes crate, whose chunk() method is implemented as follows

    fn chunk(&self) -> &[u8] {
        if self.a.has_remaining() {
            self.a.chunk()
        } else {
            self.b.chunk()
        }
    }

Therefore, the buffer, which only contains the data frame's head, is first written to the transport (which is tcp in most use cases), and the payload is then written separately. In the scenarios where nagel algorithm is disabled, it'll lead to the that data frame's head and payload are transmitted in separate tcp segments, which causes some overhead.

seanmonstar commented 1 year ago

It's a heuristic to reduce the amount of copying, if data frame content is large. If the IO transport can support vectored writes, h2 will use writev to send both pieces without copying.

xiaoyawei commented 1 year ago

@seanmonstar

If the IO transport can support vectored writes, h2 will use writev to send both pieces without copying.

I checked the linux manual (https://linux.die.net/man/2/writev) and it says "the data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes"; however it looks like that writev over a tcp socket on linux (ubuntu 20.4 for my case) does not really aggregate the 2 pieces together first and then send.

Would you be open to improve the heuristic? My proposal is

Increase CHUNK_THRESHOLD a little bit, to like 1024
When chaining a payload, still copy first X bytes to buffer, and chain the rest of the payload; here X is CHUNK_THRESHOLD - buf.remaining()

No matter what, the current performance is not ideal, for the scenarios where large amounts of small payload (around 1KB) are being transmitted

xiaoyawei commented 1 year ago

If you feel comfortable with my proposal or can suggest an alternative, I will be happy to implement it, submit a pull request, etc.

seanmonstar commented 1 year ago

writev on a TCP socket will result it the data going together, unless the amount of data is too large for a single segment of course. However, if using TLS on top, some of the transports don't implement the is_write_vectored() method.

I'm open to improving the throughput, certainly! I think that if vectored writes are supported, the threshold should stay small. But if it's not supported, it could definitely make sense to have up to a certain size buffering.

It'd help if there were some benchmarks that could show the improvement. hyper has some end-to-end benchmarks for different sizes and amounts of requests, that could be a good place to measure (or add a case that can then measure).

xiaoyawei commented 1 year ago

writev on a TCP socket will result it the data going together, unless the amount of data is too large for a single segment of course. However, if using TLS on top, some of the transports don't implement the is_write_vectored() method.

I didn't find much documents about vectorized write. In my environement, OS is Ubuntu 20.04.3 LTS; 2 packets are written at once, one of 9 bytes, the other of around 1000 bytes; TLS is not used.

It's possible that I have some inproper configs but after some researches, I didn't find what linux settings would control the behavior of writev over a TCP socket. Also keep in mind that whether TCP_NODELAY is set also affects the behavior, in my case TCP_NODELAY is on.

I think that if vectored writes are supported, the threshold should stay small. But if it's not supported, it could definitely make sense to have up to a certain size buffering.

Good suggestion; but it might be easy to tell whether vectored io is support or not. For example, in the implementation of TcpStream in tokio (https://github.com/tokio-rs/tokio/blob/37bb47c4a2aff8913e536767645772f15650e6cd/tokio/src/net/tcp/stream.rs#L1348-L1350), TcpStream::is_write_vectored() always returns true no matter what.

It'd help if there were some benchmarks that could show the improvement.

I would certainly make sure to use benchmark to measure perf changes through benchmarks

xiaoyawei commented 1 year ago

for more information, I tested TcpStream::write_vectored() from tokio in my test environment and the write here is indeed atomic (all payload is concatenated); so looks like when using tonic for grpc, the observed extra tcp segment might be due to some other stuff, I will keep it posted

xiaoyawei commented 1 year ago

@seanmonstar

I think I find (or get closer to) the root cause for my issue: tonic does not really implement the vectored io part for its BoxIo, which is a wrapper of the underlying TCP stream (https://github.com/hyperium/tonic/blob/2325e3293b8a54f3412a8c9a5fcac064fa82db56/tonic/src/transport/service/io.rs#L52-L52)

So looks like I can help with these stuff

Fix the vectored IO part on tonic
Fix Tcp::is_write_vectored() in tokio
Improve the heuristics here when vectored io is not supported

Let me know what you think. :)

seanmonstar commented 1 year ago

I think 1 and 3 would be great. Didn't you say you've already tested that 2 works?

xiaoyawei commented 1 year ago

@seanmonstar

For 2, looks like always return true (https://github.com/tokio-rs/tokio/blob/37bb47c4a2aff8913e536767645772f15650e6cd/tokio/src/net/tcp/stream.rs#L1348-L1350), but correct implementation shall depend on the underlying TcpStream.

Without 2, even if 1 and 3 are fixed, it won't help with the non-vectored case since h2 will assume the vectored IO is on, and uses default implementation of write_vectored(), which actually sends each buffer one by one

seanmonstar commented 1 year ago

I believe it always returns true because it has a specialized poll_write_vectored, which uses writev on the socket.

1 and 3 seem higher priority.

xiaoyawei commented 1 year ago

@seanmonstar

Cool, I will do 1 and 3 first.

However, for 2, I looked through the codebase again, and looks like its poll_write_vectored is implemented as (https://github.com/tokio-rs/tokio/blob/37bb47c4a2aff8913e536767645772f15650e6cd/tokio/src/io/poll_evented.rs#L218-L230)

        #[cfg(any(feature = "net", feature = "process"))]
        pub(crate) fn poll_write_vectored<'a>(
            &'a self,
            cx: &mut Context<'_>,
            bufs: &[io::IoSlice<'_>],
        ) -> Poll<io::Result<usize>>
        where
            &'a E: io::Write + 'a,
        {
            use std::io::Write;
            self.registration.poll_write_io(cx, || self.io.as_ref().unwrap().write_vectored(bufs))
        }

here self.io is actually of type mio::net::TcpStream, whose write_vectored implementation is still based on the TcpStream from ruststd.

By the end of the day, if an OS / device does not support vector IO, I feel tokio::TcpStream cannot do much to enable vectored write without manipulation of a bulk of memory. So I feel the is_write_vectored still needs to be changed in tokio

Let me know if I misunderstood any stuff, thanks! ;)

hyperium / h2

Sending data frame results in transmission of more tcp segments than needed #711

What is the issue?

What is the root cause?