rust-lang / libs-team

The home of the library team
Apache License 2.0
115 stars 18 forks source link

Add `write_utf8` to `io::Write` #282

Closed ChrisDenton closed 10 months ago

ChrisDenton commented 10 months ago

Proposal

Problem statement

When writing through io::Write we need to convert to bytes, thus losing encoding information. When writing to the Windows console, we recover this information by doing a UTF-8 check on the bytes. This should be redundant in many cases.

Requiring bytes also means that users are often required to pepper as_bytes() when writing strings, unless they use write_fmt. Though this is admittedly more of a minor annoyance then a serious issue.

Motivating examples or use cases

out.write("foo".as_bytes());
out.write("bar".as_byte());
out.write("baz".as_bytes());
// etc, etc, etc

Solution sketch

Add write_utf8 to io::Write. So named to avoid conflict with any write_str function that may be implemented on a type.

pub trait Write {
    fn write_utf8(&mut self, buf: &str) -> io::Result<usize>;
}

Admittedly write_utf8 does have the issue of what to do when a partial write falls outside of a code point boundary. This could be addressed in an implementation defined manner or just by always using write_all semantics. The only difference between the two is the first option allows for short writes that happen to fall on a boundary.

Alternatives

Incomplete UTF-8 writes

pub trait Write {
    // [u8] buffer is assumed to be UTF-8.
    // However, it may start with a partial UTF-8 sequence if it completes a previously written incomplete sequence.
    // Otherwise it's an error.
    unsafe fn write_utf8(&mut self, buf: &[u8]) -> io::Result<usize>;
}

This doesn't fully solve the issue (still needs .as_bytes()!) but allows for the implementation to do whatever it likes under the assumption that the bytes really are str that's in the process of being written.


EDIT: Remove references to ascii.

ChrisDenton commented 10 months ago

After thinking about this some more, I also opened rust-lang/rust#116871 for not erroring if given invalid Unicode (instead it's lossy). I think although the two issues are related, either would be useful whether or not the other is accepted.

ChrisDenton commented 10 months ago

This was discussed in the libs-api meeting. An important point that was raised is that, to be most useful the "is valid UTF-8" property would need to be preserved by intermediaries (e.g. buffer types). This means that all existing types (in std and the wider crate ecosystem) would need updating otherwise it'd be of very limited use. Tbh, I do agree that this is a strong argument against this proposal.

That aside, I can look at how this would affect performance as that would provide an argument for it. Though for the above reason I'm minded to close.

ChrisDenton commented 10 months ago

Closing this as per the above. I'm now convinced this is too much churn and complexity for implementers of the Write trait and when writing a large ish buffer (which people who care about perf will do) this doesn't really help as the time take to print dominates the performance.