Document encode_utf16() endianness; maybe add endianness and BOM options.

rust-lang / rust

Empowering everyone to build reliable and efficient software.

https://www.rust-lang.org

Other

98.17k stars 12.7k forks source link

Document encode_utf16() endianness; maybe add endianness and BOM options. #83102

Open BartMassey opened 3 years ago

BartMassey commented 3 years ago

The documentation does not specify the endianness of str::encode_utf16()and char::encode_utf16(): it looks from the source like they are big-endian (UTF-16BE), but I may be reading it wrong and they are little-endian (UTF-16LE) or native-endian.

This may be a deliberate design decision: if so I think it should be reconsidered, as the encoding is useless for some purposes if you don't know its endianness.

It would also be nice to indicate whether str::encode_utf16() inserts a byte-order mark (BOM): pretty sure it does not from the source, which is fine.

It is probably too late to rename these functions or to add equivalents of opposite endianness at this point, which is too bad. It's an odd API given that the corresponding decode functions have little-endian and big-endian variants.

ChrisDenton commented 3 years ago

encode_utf16 is using the platform's native endian. This is made clear when a u32 is cast directly to a u16 without converting the endian. I do agree that it may be good to explicitly document this.

https://github.com/rust-lang/rust/blob/0ab7c1d56f92ebc3c456a0c7c502ba1593e76f8c/library/core/src/char/methods.rs#L1641-L1646

The decode functions also assume native endian UTF-16.

This makes sense as a default. If necessary, endian conversion can be done before decoding or after encoding by mapping the &[u16] slice to the required endian.

BartMassey commented 3 years ago

encode_utf16 is using the platform's native endian. This is made clear when a u32 is cast directly to a u16 without converting the endian.

Thanks. My read was too quick.

I do agree that it may be good to explicitly document this.

I can submit a PR if folks like.

The decode functions also assume native endian UTF-16.

I am now thoroughly confused, as usual. I swear I saw something with endianness somewhere in std, but I can't find it now.

Anyhow, I can add the documentation about endianness and the lack of a BOM in the appropriate spots. LMK what you think of me getting a PR together.

Thanks!