Closed mzabaluev closed 5 years ago
Thanks for writing it up. This is a pretty good summary.
I don't have a philosophical opinion about the internal representation as long as it is optimal for the common use cases.
I will say that the current benchmark suite is not close to covering the common use cases, so the suite will need to be improved.
One use case is deserializing a payload into many different small values (HTTP headers, protobuf message, etc...) modifying that, then serializing back.
One use case is deserializing a payload into many different small values (HTTP headers, protobuf message, etc...) modifying that, then serializing back.
Thank you, I did not consider proxy or other modifier scenarios.
Are there many examples of modifying the field values in place as BytesMut
, rather than replacing them as the result of more complex processing involving iterators/collect, string formatting, intermediate Vec
/String
etc.?
In master now (from #298), some of these things have been done:
Bytes
API (just the position fields when making slices).BytesMut
only has KIND_VEC
and KIND_ARC
, only promoting to ARC
if you use BytesMut::split
.
This is an RFC-like redesign proposal for the API and internal representation of
Bytes
andBytesMut
to eliminate performance costs not warranted by the actual usage, and enable performance optimizations that better fit the intended usage of the crate.Motivation
The internal design choices and the API of
bytes
as of 0.4 appear to be trying to accommodate too many use cases in the most flexible way possible, at a detriment to performance of its primary usage: providing efficient memory buffers for multithreaded I/O stacks such as Tokio.Of particular concern is dynamic optimizations switching between different internal representations to accommodate specific conversions from common data types, among other reasons. The alternative representations add a lot of complexity and produce extra pressure on branch prediction and instruction cache of the CPU. Furthermore, the need to migrate the representation between different variants causes interior mutability where the public API does not require it, resulting in excessive use of atomic pointers.
It seems feasible to reduce the internal complexity and improve performance for the primary use cases by cutting down on some non-essential convenience APIs, externalizing some dynamic representation variants, and optimizing the internal layout for data and pointers. Alternatives to removed or deoptimized interfaces should be provided so as to not hurt usability too much.
Target use cases for optimization
Before starting off with breaking changes, it makes sense to determine the usage scenarios for which optimization is being performed. Other usages may present conflicting optimization requirements, which however should be considered of lesser priority.
The main purpose of
bytes
is providing efficient, shallow-copied memory buffers to Tokio and possible similar network I/O stacks. Therefore the most important use patterns would be found in performing input, output, and short-circuiting between the two.Decoding input
A pattern commonly seen with processing of input is to read data incrementally into a
BytesMut
with a pre-reserved capacity on the order of a few memory pages and extended further if necessary, until enough is received to be taken out of the buffer for processing, often to other tasks or threads. Parts of the buffer may be split off intoBytes
views representing byte array fields, strings, and the like.Below is a simplified example demonstrating this usage scenario (in 0.4 API):
If processing is performed asynchronously from the reading loop and may be slower than the throughput of the reading channel, some back pressure scheme should be implemented by the application to prevent the buffer from growing indefinitely.
Encoding output
A common pattern for network protocol implementations is to repeatedly populate buffers with encoded messages using interfaces provided by
BytesMut
. The frozen buffers are then passed to an output sink for sending.An obvious route for optimization here, and one that is up to the user of the crate, is to avoid frequent allocations by reusing the
BytesMut
instance and flush the output sink before encoding the next message, so that theBytes
reference(s) split off the mutable buffer are dropped and the capacity can be reused. In buffering scenarios where an upper bound for the capacity of the sink in terms ofBytes
instances can be estimated, a more complicated scheme can be employed rotating through a larger number ofBytesMut
instances and relying on back pressure of the sink to free them up for reuse.It is worth noting that memory management with
Bytes
andBytesMut
resembles a custom allocation layer with a few added operations, and perhaps abstracting parts of it in a generic API building on allocators could allow for more flexibility and performance optimizations, such as enabling single-threaded usage without atomic reference counting. Any such drastic change, though, is left out of scope for this proposal.Literals and static constant data
Sometimes entire
Bytes
instances are constructed from literal values or'static
const data. As network protocols tend to be used in applications to send dynamically variable messages, importance of optimizing for such use versus composing of larger contiguous buffers with, e.g.,write!(&mut buf, ...)
, needs to be ascertained.Conversion from standard library owned containers
Users may elect to populate a
Vec
or aString
and then transform it intoBytes
. This may be done for convenience of the richer API available for the standard types, but may also be dictated by data types used by interfaces with other parts of the program.Zero copy pass-through
It should be possible to splice a stream of blocks of data received from an input source, e.g. an
AsyncRead
transport connection, to an output sink such as one implementingAsyncWrite
, without copying large amounts of data between intermediate buffers. This is achieved in the current design, and this proposal does not intend to change it.Modify and forward
Another important use case is deserializing a payload into many different small values (HTTP headers, protobuf message, etc...) modifying that, then serializing back or forwarding the compound structure for further processing. The modification of individual field values is likely done by replacing them with newly constructed values rather than modifying the buffers in place via
Bytes
orBytesMut
handles, which forces an internal reallocation anyway if the value needs to be expanded.Proposed changes
Remove mutation API from Bytes
The
extend*
APIs onBytes
are redundant with the same forBytesMut
coupled with the zero-copy conversions:BytesMut::freeze
or theFrom
impl. Removing the mutating methods and the notion of capacity fromBytes
will reduce the size of the structure to 3 words instead of the current 4.Cut back on alternate internal representations
The alternate representations tagged by
KIND_STATIC
andKIND_VEC
lazily optimize conversions from certain foreign data types, namely built-in slices of'static
lifetime andVec
. The branching to check for these special cases, however, has a runtime cost that all users of the library have to pay, whether they use these conversions or not. In the spirit of zero cost abstractions, the prevailing practice seen e.g. in the Rust standard library is to make dynamic optimization variants explicit with sum types likeCow
.For applications that want to benefit from non-copying use of possible alternate byte slice containers, the following enumeration type should be provided:
The dynamic complexity currently hidden in the implementation of
Bytes
can be replicated explicitly with typeInput<Cow<'a, [u8]>>
.In the long term, special case optimizations of this kind should be better addressed at compile time, with abstracting over
Buf
orIntoBuf
and specializing forBytes
. This is already a design consideration made by some crate authors.Some of the convenience available with
Vec
/String
is possible to replicate withBytesMut
in a slightly less convenient way: for example, users who doBytes::from(format!(...))
could be instructed to allocate aBytesMut
instance with sufficient capacity, dowrite!(&mut buf, ...).unwrap();
, and call.take()
or.freeze()
on the written buffer.KIND_VEC as a "lightweight" representation
Besides optimizing conversions,
KIND_VEC
serves as a ligthweight representation of a buffer that has not yet been shared, avoiding an extra allocation for theShared
descriptor. Unfortunately, this causes interior mutability at the point when a shallow clone is made, and as a consequence the pointer toShared
needs to be accessed atomically almost everywhere.As most of the primary use scenarios split off view handles from the buffer, it would seem better to start with the shareable representation up front and let the compiler make use of normal mutability semantics and pointer optimizations available for
ptr::NonNull
(tokio-rs/bytes#241).In-place initialization with literals
An interesting special case is initializing instances of
Bytes
orBytesMut
with literal byte arrays or strings. To get the optimization back, in-place construction macrosbytes![]
andbytes_mut![]
can be provided; as an added benefit, the macros might be able to select the optimal internal representation (among those remaining) at compile time.ARC as the only representation of BytesMut
Given the primary usage scenarios outlined above, it does not seem reasonable to expect that many applications populating
BytesMut
instances would consistenly keep within buffer sizes fitting the inline layout (31 bytes on 64-bit targets). The applications that fit it only occasionally may incur branch misprediction penalties, detracting from the performance advantage.By streamlining
BytesMut
code paths to only use the shared ARC representation and letting the user optimize by reusing buffers if necessary, optimal performance for the primary use cases can be achieved. Inlining ofBytes
contents for small sizes can still be taken advantage of (albeit with smaller size limits due to the reduced size of the structure) with the API changes proposed in the next section.The splitting methods on BytesMut to return Bytes
Methods
BytesMut::split_to
andBytesMut::take
returnBytesMut
, although cases of modifying a severed head of the buffer once it's been read are unlikely to be common, and the head part would need to be reallocated if its content is extended. As seen in the input example, a.freeze()
method call is used to convert the split-off part toBytes
for further processing. This is just a type-shifting no-op in the current implementation, but with the internal simplification ofBytesMut
proposed above, it would be better for the splitting methods to directly constructBytes
instances, taking possible advantage of the inline layout in the latter without a temporaryBytesMut
instance with its reference count operations that otherwise would be necessary forbuf.split_to(n).freeze()
.For applications that do need splitting into mutable buffers, methods
BytesMut::split_to_mut
andBytesMut::take_mut
can be provided as a replacement.Benchmarks
The existing benchmark suite will be used to provide end proof of improvement. The benchmarks should be preserved or updated to equivalent new APIs as much as possible. More benchmarks should be written to showcase improvement in the primary use cases, if necessary.
Strive to uphold the principle of Zero Cost Abstractions!