rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.35k stars 12.72k forks source link

Tracking Issue for explicit-endian String::from_utf16 #116258

Open CAD97 opened 1 year ago

CAD97 commented 1 year ago

Feature gate: #![feature(str_from_utf16_endian)]

This is a tracking issue for versions of String::from_utf16 which take &[u8] and use a specific endianness.

Public API

impl String {
    fn from_utf16le(v: &[u8]) -> Result<String, FromUtf16Error>;
    fn from_utf16le_lossy(v: &[u8]) -> String;
    fn from_utf16be(v: &[u8]) -> Result<String, FromUtf16Error>;
    fn from_utf16be_lossy(v: &[u8]) -> String;
}

Steps / History

Unresolved Questions

zachs18 commented 9 months ago

Perhaps as an unresolved question: with these added, FromUtf16Error's Display impl is no longer always accurate; it says "invalid utf-16: lone surrogate found", but these functions introduce a new failure case: the &[u8] was of odd length. Making FromUtf16Error hold information about which kind of error occurred would require making it not a ZST anymore, which could degrade performance since currently Result<String, FromUtf16Error> is (non-guaranteed-ly) null-pointer-optimized to be the same size as String. (see below)

Alternately, they could return some new FromUtf16BytesError type which can represent both errors, so that String::from_utf16 can still return the null-pointer-optimized Result<String, FromUtf16Error>.

(Alternately, FromUtf16Error's Display impl could be updated to say something like "invalid utf-16: lone surrogate found, or odd length byte string passed".)

CAD97 commented 9 months ago

To note, Result<String, enum { L, R }> is still niched. The data pointer is null and the other 2×usize are available to carry the Err payload. The only performance hit would be constructing or inspecting the error payload.

But that said, I also think just rendering the error as invalid utf-16 would be sufficient. Adding a new variant to the existing enum is also fine, but I don't think making a new error type is particularly helpful.

An alternative would be to panic if given an odd-length slice, since that's trivial to precheck. But not a particularly good alternative.