Open xiaoyawei opened 1 year ago
Relevant precedent - in hyper 1.0 instead of collecting large response bodies into one contiguous BufMut
it is using BufList
from http-body-util which is a list of discontiguous buffers.
This is almost identical to the SliceBuffer proposed in this issue - BufList
also impls Buf
and BufMut
, etc.
@hzuo but do large fields in the message itself end up as separate buffers? In other words, if I have a field in a message containing 4MB of data, does it end up in one of these buffers, or does it still get copied / concatenated somewhere?
@JonathanWilbur yes, at least at the hyper level the body is an opaque binary stream, so there's no way for them to preserve "field boundaries" - if you have a single big binary field it will be fragmented across multiple bufs in the BufList.
I think this is the right approach even if you made the body collector "protobuf-aware" s.t. that "knows" about protobuf-level field boundaries. Ultimately the data coming in from the network comes in distinct packets/frames that have fixed sizes, so transferring a large binary blob like bytes data_body = 1000;
will fundamentally be split across multiple chunks over the network.
I think the key point raised by this issue that is that we want to combine the data fragments into one big allocation exactly once, after we've received the full field/message, rather than re-allocating a contiguous buffer over and over again due to growth.
Ofc the other desired property is that this single combined allocation should zero-copy - it should be able to be reused without ever being copied again (i.e. during parsing prost shouldn't copy the slice corresponding to data_body
into a Vec<u8>
in the FlightData
struct). This should already be guaranteed as of prost 0.8.0, if prost is configured use Bytes
rather than Vec<u8>
:
https://github.com/tokio-rs/prost/pull/449
https://github.com/tokio-rs/prost/releases/tag/v0.8.0
BTW we got a report downstream in arrow-rs where the C++ implementation is significantly faster than the Rust implementation due to the copies in the tonic stack: https://github.com/apache/arrow-rs/issues/6670
We have some ways to improve things but to get parity we would likely need this change as well.
Feature Request: Implement Non-contiguous Memory Buffer to Optimize Data Transfer
Crates
tonic
tonic-build
Motivation
Many well-adopted
gRPC
use cases, such as Arrow Flight, utilizegRPC
to transfer large data chunks. In essence, Arrow Flight employs thegRPC
protocol for network data transfer with the following protobuf definition:Here,
FlightDescriptor
is another protobuf message defined by Arrow Flight, which is not detailed here for brevity. Thedata_body
field, which typically holds large payloads, contrasts with the other three fields that are relatively small. Arrow Flight is recognized for its high performance in transferring large data chunks, with specialized and optimized implementations in languages like C++ and JAVA.In Rust, when using
tonic
for Arrow Flight, I observed a potential improvement area related to memory copying. The default codec intonic
utilizesbytes::BytesMut
, essentially a contiguous memory chunk. Let's delve into the specifics:Encoder
The default implementation directly copies the
data_body
into the buffer, leading to amemcpy
syscall over a large memory segment and triggering memory reallocation.Decoder
Upon polling a data chunk from the gRPC response body, the decoder directly transfers this data to the buffer, which again causes memory copying and reallocation.
Proposal
Non-contiguous memory buffer
Rather than relying on
bytes::BytesMut
in the codec, we should introduce a specialized buffer backed by non-contiguous memory. The C++ gRPC solution has aSliceBuffer
structure, comprising a set of slices (contiguous memory segments). We can take inspiration from this and create a Rust counterpart:Here,
SliceBuffer
portrays bytes as a concatenated sequence of all slices combined with the active. Additionally, we'd implement necessary traits, includingbytes::Buf
andbytes::BufMut
, ensuring seamless read/write operations.Encoder
The default protobuf encoder logic remains unchanged: bytes are read/written to the
active
field of theSliceBuffer
. For specific scenarios like Arrow Flight, users would implement custom codecs. Using Arrow Flight as an example, the custom codec would directly append thedata_body
bytes ofFlightData
to theslices
field of SliceBuffer, thus eliminating memory copying.Furthermore, by implementing the
chunks_vectored
method forSliceBuffer
, the underlying transport can directly send theSliceBuffer
, capitalizing on vectored I/O and preventingdata_body
memory copying.Decoder
When the decoder polls a data chunk from the gRPC response body, it appends it directly to the
slices
ofSliceBuffer
. Avoiding memory copying in situations with large data chunks can significantly boost performance.Drawback
For use cases with smaller data volumes, memory copying isn't the primary performance concern. However, the proposed solution might introduce slight overhead in such scenarios compared to the direct usage of
bytes::BytesMut
. Even so, this minor overhead fromSliceBuffer
would likely not emerge as a new bottleneck.Alternatives
One alternative is to place the
SliceBuffer
behind a feature flag, retainingtonic
's default behavior. Based on the proposal, the encoder's default behavior remains unchanged unless a custom codec is specified, but the default decoder changes to avoid memory copying by inserting the slice directly into theSliceBuffer
.My benchmarks show decoder performance improvements in many scenarios, especially those mirroring real-world use cases. Adding a feature guard might confuse users and detract from codebase readability, so I recommend against it.