hyperium / tonic

A native gRPC client & server implementation with async/await support.
https://docs.rs/tonic
MIT License
9.99k stars 1.01k forks source link

Implement Non-contiguous Memory Buffer to Optimize Data Transfer #1558

Open xiaoyawei opened 1 year ago

xiaoyawei commented 1 year ago

Feature Request: Implement Non-contiguous Memory Buffer to Optimize Data Transfer

Crates

Motivation

Many well-adopted gRPC use cases, such as Arrow Flight, utilize gRPC to transfer large data chunks. In essence, Arrow Flight employs the gRPC protocol for network data transfer with the following protobuf definition:

message FlightData {
  FlightDescriptor flight_descriptor = 1;
  bytes data_header = 2;
  bytes app_metadata = 3;
  bytes data_body = 1000;
}

Here, FlightDescriptor is another protobuf message defined by Arrow Flight, which is not detailed here for brevity. The data_body field, which typically holds large payloads, contrasts with the other three fields that are relatively small. Arrow Flight is recognized for its high performance in transferring large data chunks, with specialized and optimized implementations in languages like C++ and JAVA.

In Rust, when using tonicfor Arrow Flight, I observed a potential improvement area related to memory copying. The default codec in tonic utilizes bytes::BytesMut, essentially a contiguous memory chunk. Let's delve into the specifics:

Encoder

The default implementation directly copies the data_body into the buffer, leading to a memcpy syscall over a large memory segment and triggering memory reallocation.

Decoder

Upon polling a data chunk from the gRPC response body, the decoder directly transfers this data to the buffer, which again causes memory copying and reallocation.

Proposal

Non-contiguous memory buffer

Rather than relying on bytes::BytesMut in the codec, we should introduce a specialized buffer backed by non-contiguous memory. The C++ gRPC solution has a SliceBuffer structure, comprising a set of slices (contiguous memory segments). We can take inspiration from this and create a Rust counterpart:

pub struct SliceBuffer {
    active: BytesMut,
    len: usize,
    slices: VecDeque<bytes::Bytes>,
}

Here, SliceBuffer portrays bytes as a concatenated sequence of all slices combined with the active. Additionally, we'd implement necessary traits, including bytes::Buf and bytes::BufMut, ensuring seamless read/write operations.

Encoder

The default protobuf encoder logic remains unchanged: bytes are read/written to the active field of the SliceBuffer. For specific scenarios like Arrow Flight, users would implement custom codecs. Using Arrow Flight as an example, the custom codec would directly append the data_body bytes of FlightData to the slices field of SliceBuffer, thus eliminating memory copying.

Furthermore, by implementing the chunks_vectored method for SliceBuffer, the underlying transport can directly send the SliceBuffer, capitalizing on vectored I/O and preventing data_body memory copying.

Decoder

When the decoder polls a data chunk from the gRPC response body, it appends it directly to the slices of SliceBuffer. Avoiding memory copying in situations with large data chunks can significantly boost performance.

Drawback

For use cases with smaller data volumes, memory copying isn't the primary performance concern. However, the proposed solution might introduce slight overhead in such scenarios compared to the direct usage of bytes::BytesMut. Even so, this minor overhead from SliceBuffer would likely not emerge as a new bottleneck.

Alternatives

One alternative is to place the SliceBuffer behind a feature flag, retaining tonic's default behavior. Based on the proposal, the encoder's default behavior remains unchanged unless a custom codec is specified, but the default decoder changes to avoid memory copying by inserting the slice directly into the SliceBuffer.

My benchmarks show decoder performance improvements in many scenarios, especially those mirroring real-world use cases. Adding a feature guard might confuse users and detract from codebase readability, so I recommend against it.

hzuo commented 11 months ago

Relevant precedent - in hyper 1.0 instead of collecting large response bodies into one contiguous BufMut it is using BufList from http-body-util which is a list of discontiguous buffers.

This is almost identical to the SliceBuffer proposed in this issue - BufList also impls Buf and BufMut, etc.

JonathanWilbur commented 8 months ago

@hzuo but do large fields in the message itself end up as separate buffers? In other words, if I have a field in a message containing 4MB of data, does it end up in one of these buffers, or does it still get copied / concatenated somewhere?

hzuo commented 8 months ago

@JonathanWilbur yes, at least at the hyper level the body is an opaque binary stream, so there's no way for them to preserve "field boundaries" - if you have a single big binary field it will be fragmented across multiple bufs in the BufList.

I think this is the right approach even if you made the body collector "protobuf-aware" s.t. that "knows" about protobuf-level field boundaries. Ultimately the data coming in from the network comes in distinct packets/frames that have fixed sizes, so transferring a large binary blob like bytes data_body = 1000; will fundamentally be split across multiple chunks over the network.

I think the key point raised by this issue that is that we want to combine the data fragments into one big allocation exactly once, after we've received the full field/message, rather than re-allocating a contiguous buffer over and over again due to growth.

Ofc the other desired property is that this single combined allocation should zero-copy - it should be able to be reused without ever being copied again (i.e. during parsing prost shouldn't copy the slice corresponding to data_body into a Vec<u8> in the FlightData struct). This should already be guaranteed as of prost 0.8.0, if prost is configured use Bytes rather than Vec<u8>: https://github.com/tokio-rs/prost/pull/449 https://github.com/tokio-rs/prost/releases/tag/v0.8.0

alamb commented 3 days ago

BTW we got a report downstream in arrow-rs where the C++ implementation is significantly faster than the Rust implementation due to the copies in the tonic stack: https://github.com/apache/arrow-rs/issues/6670

We have some ways to improve things but to get parity we would likely need this change as well.