Can encode_from_utf16 store pending high surrogate?

hsivonen / encoding_rs

A Gecko-oriented implementation of the Encoding Standard in Rust

Other

384 stars 55 forks source link

use encoding_rs; fn utf16_to_utf8() { let mut encoder = encoding_rs::UTF_8.new_encoder(); let src = [0xD83Du16]; let mut dst = [0u8;4]; encoder.encode_from_utf16(&src, &mut dst, false); println!("{:?}", dst); let src = [0xDC99u16]; let mut dst = [0u8;4]; encoder.encode_from_utf16(&src, &mut dst, true); println!("{:?}", dst); } fn utf16_to_utf8_2() { let mut decoder = encoding_rs::UTF_16LE.new_decoder(); let src = [0x3Du8, 0xD8u8]; let mut dst = [0u8;4]; decoder.decode_to_utf8(&src, &mut dst, false); println!("{:?}", dst); let src = [0x99u8, 0xDCu8]; let mut dst = [0u8;4]; decoder.decode_to_utf8(&src, &mut dst, true); println!("{:?}", dst); } fn main() { utf16_to_utf8(); utf16_to_utf8_2(); }

This is intentional and documented.

The thinking behind this is that the decoder receives data from the wild, so stuff getting split across I/O buffers is normal and errors in the data are not programming errors in the application using the library.

In contrast, the encoder receives application-internal Unicode representations. In this case, the caller is expected to keep each of its internal buffers valid on a per-buffer basis. This is conceptually similar to the case of receiving application-internal UTF-8 and encoding it into output. However, in that case, UTF-8 validity is enforced on the type system level.

This design thinking doesn't quite fit the case where the encoder receives data from a JavaScript engine where the JavaScript program comes from the wild, and the encoder input is a sequence of DOMStrings as opposed to a sequence of USVStrings.

Sorry about this design decision not being a great fit for your use case. However, I'm reluctant to change the encoding_rs-level design here.

hsivonen / encoding_rs

Can encode_from_utf16 store pending high surrogate? #82