rust-lang / libs-team

The home of the library team
Apache License 2.0
110 stars 18 forks source link

Un-specialize impl ToString #363

Open HyeonuPark opened 3 months ago

HyeonuPark commented 3 months ago

Proposal

Problem statement

"&str".to_string() is such a fundamental part of the Rust language. It's the second function introduced by "the book" to construct String right after the String::new(). And still is one of the most preferable methods to construct owned String for many experienced Rust developers. But the method is based on the Display trait, which implies std::fmt infrastructures with unavoidable dynamic dispatches and optimization blocker.. which is not. Since it's such a common function we've especially optimized them to ideal code with specialization.

But specialization is not a silver bullet. Since it "specialize" certain types it doesn't compose naturally. For example, (&"&str").to_string() doesn't take specialized route and perform dynamic dispatch. Same applies for Arc<str>, arrayvec::ArrayString etc. and with those also-specialized primitive types like bool, i32 but with some indirections.

Additionally, If we have general optimization infrastructure it can be reused for other functions that consume impl Display, like fn write_to(target: &mut String, content: impl Display).

Motivating examples or use cases

These .to_string() calls invokes std::fmt infrastructure with dynamic dispatch. Specialization optimizes few chosen cases but the condition is subtle which can be surprising.

fn cond(s: String) -> bool {...}
let b = vec!["foo", "bar"].iter().any(|s| cond(s.to_string()));

let s: Arc<str> = Arc::from("foo");
s.to_string();

// external crate defined types

let s: arrayvec::ArrayString = "foo".into();
s.to_string();

let s = FourCC::new(r"mpeg").unwrap();
s.to_string();

// and more types without touching specialization

42_i32.to_string();

std::net::Ipv4Addr::LOCALHOST.to_string()

Solution sketch

The goal of this ACP is to un-specialize impl ToString without affecting performance, by adding a new method to the trait Display. Currently handful of types are specialized with ToString trait and they can be divided into 3 groups:

To address all those cases I propose to add another method to trait Display:

pub trait Display {
    ...
    fn try_to_str<'a>( // <- New!
        &'a self,
        temp_buf: &'a mut TempBuffer,
    ) -> Result<&'a str, TryToStrError> {
        Err(TryToStrError::new())
    }
}

const TEMP_BUF_CAPACITY: usize = 64;

#[repr(align(STACK_ALIGN))]
pub struct TempBuffer {
    buffer: [MaybeUninit<u8>; TEMP_BUF_CAPACITY],
}

impl TempBuffer {
    fn new() -> Self {...}
    fn buffer<const N: usize>(&mut self) -> Result<&mut [u8; N], TryToStrError> {...}
    fn uninit_buffer<const N: usize>(&mut self) -> Result<&mut [MaybeUninit<u8>; N], TryToStrError> {...}
}

pub struct TryToStrError {...}

impl TryToStrError {
    fn new() -> Self {...}
    fn with_reserve_hint(self, reserve_hint: usize) -> Self {...}
    fn reserve_hint(&self) -> usize {...}
}

// And replace existing specialization-based `impl ToString`s
impl<T: Display> ToString for T {
    fn to_string(&self) -> String {
        let mut buf = match self.try_to_str(&mut TempBuffer::new()) {
            Ok(s) => return s.to_owned(),
            Err(err) => String::with_capacity(err.reserve_hint()),
        };

        let mut formatter = core::fmt::Formatter::new(&mut buf);
        // Bypass format_args!() to avoid write_str with zero-length strs
        fmt::Display::fmt(self, &mut formatter)
            .expect("a Display implementation returned an error unexpectedly");
        buf
    }
}

// Example `impl Display` code

impl fmt::Display for str {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {...}

    fn try_to_str<'a>(
        &'a self,
        temp_buf: &'a mut [u8],
    ) -> Result<&'a str, TryToStrError> {
        Ok(self)
    }
}

impl fmt::Display for u8 {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {...}

    fn try_to_str<'a>(
        &'a self,
        temp_buf: &'a mut TempBuffer,
    ) -> Result<&'a str, TryToStrError> {
        const REQUIRED_BUF_CAPACITY: usize = 3;
        let buf = temp_buf.buffer::<REQUIRED_BUF_CAPACITY>()?;

        // write to temp_buf
        let mut n = *self;
        let mut offset = 0;
        if n >= 10 {
            if n >= 100 {
                buf[offset] = b'0' + n / 100;
                offset += 1;
                n = n % 100;
            }
            buf[offset] = b'0' + n / 10;
            offset += 1;
            n = n % 10;
        }
        buf[offset] = b'0' + n;
        offset += 1;

        Ok(unsafe {
            std::str::from_utf8_unchecked(&buf[..offset])
        })
    }
}

Types which delegates to str should return reference to its internal str value. Types that need small buffer for formatting can use given temp_buf argument. Types which doesn't know its buffer size statically but still knows its capacity hint dynamically, like fmt::Argument<'_>, can returns an error with the capacity hint. For other types this change should be no-op.

Buffer capacity

To make this change really zero cost the check if the buffer provides enough space (branch in TempBuffer::buffer()) compares 2 constants (TEMP_BUF_CAPACITY and REQUIRED_BUF_CAPACITY) and the branch is expected to be optimized out.

The callee know its required buffer capacity, but callers need to decide the capacity that is large enough for practical cases. For example f64 may take up to 17 bytes to print, and SocketAddr may take 58 bytes. I suggest to use 64 as a base value since it's small enough for small stack buffer and still large enough for most leaf types.

Composite types

While the fmt method is designed to be composited, it's not generally recommended for composite types to override try_to_str method. This method is designed for directly consuming leaf types without any additional formatter arguments like .to_string(). Wrapper types(like Arc<T>) may override this method to delegate to its inner type.

Alternatives

We can keep the current specialization based structure. It's proven to be good enough for most use cases and we may still add another specialized impl when needed.

Technically this behavior can be implemented with 3rd party crate with its own ToString and DisplayExt traits. But its .to_string() method can't be applied to other crate's types that only implements standard Display trait not the 3rd party DisplayExt trait.

As noted earlier there's 2 main reasons to avoid default .to_string() impl in cricical code - 1. it's not optimal impl for leaf types, 2. dyn traits involved kills optimizations from surrounding code. The second symptom might be considered a bug, and may eventually be resolved. After that we may find that default impl is good enough for every types now.

Links and related work

Edits

RustyYato commented 3 months ago

I think we should use core::io::BorrowedCursor instead of &mut [u8] (or maybe something similar that only allows writing str) so we can skip initializing the buffer.

HyeonuPark commented 3 months ago

I'm afraid the BorrowedCursor's additional context gives negative impact on optimization. This is not an IO buffer - it should be declared and filled within a function with (hopefully inlined) non-IO code. It might be better to pass &mut [MaybeUninit<u8>] instead, though.

CAD97 commented 3 months ago

Using MaybeUninit<u8> instead of u8 for the buffer is desirable to keep the optimization path (mostly) "zero cost" even for dyn Display by avoiding redundant zeroing of the stack memory[^3]. Additionally, I think making the temp stack buffer size a known constant to try_to_str (i.e. use an array instead of slice, even if not as an API exposed constant) instead of a slice for this for-optimization API surface would be more "proper" and assist with inlining of the method (e.g. MIR inlining currently really doesn't like branches).

[^3]: Even with access to a "freeze" operation for the stack bytes, exposing contents of "freed" stack memory to arbitrary downstream safe code, while not unsound, is still undesirable.

I personally think it's justifiable to define a specific fmt::TempBuffer struct for usage here. It's a specific application that would benefit from a specific more targeted type API than just &mut [MaybeUninit<u8>; fmt::TEMP_BUF_SIZE]. io::BorrowedCursor has the extra context/overhead of tracking the init/writ subslice, which is specifically not necessary (nor desired) in this case, and could even mislead someone into thinking what's written into the temp space is relevant, rather than exclusively just the returned &str. Specifically, it controls against giving a slice of the output buffer to try_to_str, which is explicitly an incorrect usage (this is not a write). Additionally, it would permit marking the buffer as known aligned to the ABI stack alignment[^2], which could improve codegen some more.

[^2]: I believe this is 4 bytes on most 32-bit targets and 16 (not 8, because SIMD) on most most 64-bit targets.

also cf. char::encode_utf8(self, &mut [u8]) -> &mut str.

Minor naming thing: TryToStrError::capacity_hint might also be better named as reserve_hint, as it's not only useful for a new output with_capacity, but also potentially for a reserve when writing into an existing buffer (e.g. imagine a clone_from like analog for to_string). Additionally, if Iterator grows the oft-discussed single-usize alternative to size_hint[^1], this method should mirror the name/semantics of that one.

[^1]: TL;DR: 99.9% of size_hint consumers only use the lower bound and ignore the upper bound entirely; AIUI the conservatively correct upper bound just isn't that useful of a bound. The useful "reserve_hint" for iterators AIUI is a cheap and conservative size prediction independent of the size bounds.

For example, (&"&str").to_string() doesn't take specialized route and perform dynamic dispatch.

For explicitness, a delegating specialization impl<T: ?Sized> ToString for &'_ T where T: ToString would unfortunately be unsound, as ToString is not a specialization-safe trait. Namely, downstream could implement ToString both for &'_ Ty<'_> and Ty<'static> (so long as Ty is not Display), which would make impl selection for &Ty depend on a lifetime, i.e. be unsound (or at a bare minimum, extremely surprising[^4]).

A type bound for "captures no lifetimes" (i.e. safe to specialize) would solve a single level of additional indirection, but not additional levels. Bounding T by Display instead might maintain soundness (by strictly and universally subsetting the existing blanket impl), but I'm not 100% confident in asserting so. (It's times like these I wish std had had access to #[sealed] for extension style traits pre-1.0 and had used it liberally... alas.)

[^4]: Specifically, in theory lifetime-dependent specialization "could" "just" only look at immediately evident bounds in lifetime-generic code and not actually specialize, but this mostly just defeats the purpose of specialization (cutting through layers of generality) and only even really makes any sense for strictly LSP-preserving specialization (which std limits itself to, but can't be guaranteed for downstream code, and in practice people do want to perform LSP-violating specialization).

HyeonuPark commented 3 months ago

Proposal updated. Thanks for the feedbacks!