mzabaluev commented 5 years ago

This is an RFC-like redesign proposal for the API and internal representation of Bytes and BytesMut to eliminate performance costs not warranted by the actual usage, and enable performance optimizations that better fit the intended usage of the crate.

Motivation

The internal design choices and the API of bytes as of 0.4 appear to be trying to accommodate too many use cases in the most flexible way possible, at a detriment to performance of its primary usage: providing efficient memory buffers for multithreaded I/O stacks such as Tokio.

Of particular concern is dynamic optimizations switching between different internal representations to accommodate specific conversions from common data types, among other reasons. The alternative representations add a lot of complexity and produce extra pressure on branch prediction and instruction cache of the CPU. Furthermore, the need to migrate the representation between different variants causes interior mutability where the public API does not require it, resulting in excessive use of atomic pointers.

It seems feasible to reduce the internal complexity and improve performance for the primary use cases by cutting down on some non-essential convenience APIs, externalizing some dynamic representation variants, and optimizing the internal layout for data and pointers. Alternatives to removed or deoptimized interfaces should be provided so as to not hurt usability too much.

Target use cases for optimization

Before starting off with breaking changes, it makes sense to determine the usage scenarios for which optimization is being performed. Other usages may present conflicting optimization requirements, which however should be considered of lesser priority.

The main purpose of bytes is providing efficient, shallow-copied memory buffers to Tokio and possible similar network I/O stacks. Therefore the most important use patterns would be found in performing input, output, and short-circuiting between the two.

Decoding input

A pattern commonly seen with processing of input is to read data incrementally into a BytesMut with a pre-reserved capacity on the order of a few memory pages and extended further if necessary, until enough is received to be taken out of the buffer for processing, often to other tasks or threads. Parts of the buffer may be split off into Bytes views representing byte array fields, strings, and the like.

Below is a simplified example demonstrating this usage scenario (in 0.4 API):

use bytes::{BufMut, BytesMut};
use std::io;

use crate::protocol;

fn process_input<R>(mut reader: R) -> io::Result<()>
where
    R: io::Read,
{
    let mut buf = BytesMut::with_capacity(4096);
    loop {
        unsafe {
            let nread = reader.read(buf.bytes_mut())?;
            if nread == 0 {
                break;
            }
            buf.advance_mut(nread);
        };
        let (extent_complete, need_more) = protocol::decode(&mut buf)?;
        if extent_complete != 0 {
            let content = buf.split_to(extent_complete).freeze();
            send_to_processing(content);
        }
        buf.reserve(need_more);
    }
    Ok(())
}

If processing is performed asynchronously from the reading loop and may be slower than the throughput of the reading channel, some back pressure scheme should be implemented by the application to prevent the buffer from growing indefinitely.

Encoding output

A common pattern for network protocol implementations is to repeatedly populate buffers with encoded messages using interfaces provided by BytesMut. The frozen buffers are then passed to an output sink for sending.

An obvious route for optimization here, and one that is up to the user of the crate, is to avoid frequent allocations by reusing the BytesMut instance and flush the output sink before encoding the next message, so that the Bytes reference(s) split off the mutable buffer are dropped and the capacity can be reused. In buffering scenarios where an upper bound for the capacity of the sink in terms of Bytes instances can be estimated, a more complicated scheme can be employed rotating through a larger number of BytesMut instances and relying on back pressure of the sink to free them up for reuse.

It is worth noting that memory management with Bytes and BytesMut resembles a custom allocation layer with a few added operations, and perhaps abstracting parts of it in a generic API building on allocators could allow for more flexibility and performance optimizations, such as enabling single-threaded usage without atomic reference counting. Any such drastic change, though, is left out of scope for this proposal.

Literals and static constant data

Sometimes entire Bytes instances are constructed from literal values or 'static const data. As network protocols tend to be used in applications to send dynamically variable messages, importance of optimizing for such use versus composing of larger contiguous buffers with, e.g., write!(&mut buf, ...), needs to be ascertained.

Conversion from standard library owned containers

Users may elect to populate a Vec or a String and then transform it into Bytes. This may be done for convenience of the richer API available for the standard types, but may also be dictated by data types used by interfaces with other parts of the program.

Zero copy pass-through

It should be possible to splice a stream of blocks of data received from an input source, e.g. an AsyncRead transport connection, to an output sink such as one implementing AsyncWrite, without copying large amounts of data between intermediate buffers. This is achieved in the current design, and this proposal does not intend to change it.

Modify and forward

Another important use case is deserializing a payload into many different small values (HTTP headers, protobuf message, etc...) modifying that, then serializing back or forwarding the compound structure for further processing. The modification of individual field values is likely done by replacing them with newly constructed values rather than modifying the buffers in place via Bytes or BytesMut handles, which forces an internal reallocation anyway if the value needs to be expanded.

Proposed changes

Remove mutation API from Bytes

The extend* APIs on Bytes are redundant with the same for BytesMut coupled with the zero-copy conversions: BytesMut::freeze or the From impl. Removing the mutating methods and the notion of capacity from Bytes will reduce the size of the structure to 3 words instead of the current 4.

Cut back on alternate internal representations

The alternate representations tagged by KIND_STATIC and KIND_VEC lazily optimize conversions from certain foreign data types, namely built-in slices of 'static lifetime and Vec. The branching to check for these special cases, however, has a runtime cost that all users of the library have to pay, whether they use these conversions or not. In the spirit of zero cost abstractions, the prevailing practice seen e.g. in the Rust standard library is to make dynamic optimization variants explicit with sum types like Cow.

For applications that want to benefit from non-copying use of possible alternate byte slice containers, the following enumeration type should be provided:

pub enum Input<T> {
    Native(Bytes),
    Foreign(T),
}

impl<T: Into<Bytes>> From<Input<T>> for Bytes {
    ...
}

impl<T: AsRef<[u8]>> IntoBuf for Input<T> {
    ...
}

The dynamic complexity currently hidden in the implementation of Bytes can be replicated explicitly with type Input<Cow<'a, [u8]>>.

In the long term, special case optimizations of this kind should be better addressed at compile time, with abstracting over Buf or IntoBuf and specializing for Bytes. This is already a design consideration made by some crate authors.

Some of the convenience available with Vec/String is possible to replicate with BytesMut in a slightly less convenient way: for example, users who do Bytes::from(format!(...)) could be instructed to allocate a BytesMut instance with sufficient capacity, do write!(&mut buf, ...).unwrap();, and call .take() or .freeze() on the written buffer.

KIND_VEC as a "lightweight" representation

Besides optimizing conversions, KIND_VEC serves as a ligthweight representation of a buffer that has not yet been shared, avoiding an extra allocation for the Shared descriptor. Unfortunately, this causes interior mutability at the point when a shallow clone is made, and as a consequence the pointer to Shared needs to be accessed atomically almost everywhere.

As most of the primary use scenarios split off view handles from the buffer, it would seem better to start with the shareable representation up front and let the compiler make use of normal mutability semantics and pointer optimizations available for ptr::NonNull (tokio-rs/bytes#241).

In-place initialization with literals

An interesting special case is initializing instances of Bytes or BytesMut with literal byte arrays or strings. To get the optimization back, in-place construction macros bytes![] and bytes_mut![] can be provided; as an added benefit, the macros might be able to select the optimal internal representation (among those remaining) at compile time.

ARC as the only representation of BytesMut

Given the primary usage scenarios outlined above, it does not seem reasonable to expect that many applications populating BytesMut instances would consistenly keep within buffer sizes fitting the inline layout (31 bytes on 64-bit targets). The applications that fit it only occasionally may incur branch misprediction penalties, detracting from the performance advantage.

By streamlining BytesMut code paths to only use the shared ARC representation and letting the user optimize by reusing buffers if necessary, optimal performance for the primary use cases can be achieved. Inlining of Bytes contents for small sizes can still be taken advantage of (albeit with smaller size limits due to the reduced size of the structure) with the API changes proposed in the next section.

The splitting methods on BytesMut to return Bytes

Methods BytesMut::split_to and BytesMut::take return BytesMut, although cases of modifying a severed head of the buffer once it's been read are unlikely to be common, and the head part would need to be reallocated if its content is extended. As seen in the input example, a .freeze() method call is used to convert the split-off part to Bytes for further processing. This is just a type-shifting no-op in the current implementation, but with the internal simplification of BytesMut proposed above, it would be better for the splitting methods to directly construct Bytes instances, taking possible advantage of the inline layout in the latter without a temporary BytesMut instance with its reference count operations that otherwise would be necessary for buf.split_to(n).freeze().

For applications that do need splitting into mutable buffers, methods BytesMut::split_to_mut and BytesMut::take_mut can be provided as a replacement.

Benchmarks

The existing benchmark suite will be used to provide end proof of improvement. The benchmarks should be preserved or updated to equivalent new APIs as much as possible. More benchmarks should be written to showcase improvement in the primary use cases, if necessary.

Strive to uphold the principle of Zero Cost Abstractions!

carllerche commented 5 years ago

Thanks for writing it up. This is a pretty good summary.

I don't have a philosophical opinion about the internal representation as long as it is optimal for the common use cases.

I will say that the current benchmark suite is not close to covering the common use cases, so the suite will need to be improved.

One use case is deserializing a payload into many different small values (HTTP headers, protobuf message, etc...) modifying that, then serializing back.