hsivonen / encoding_rs

A Gecko-oriented implementation of the Encoding Standard in Rust
https://docs.rs/encoding_rs/
Other
384 stars 55 forks source link

Can encode_from_utf16 store pending high surrogate? #82

Closed saschanaz closed 2 years ago

saschanaz commented 2 years ago
use encoding_rs;

fn utf16_to_utf8() {
    let mut encoder = encoding_rs::UTF_8.new_encoder();

    let src = [0xD83Du16];
    let mut dst = [0u8;4];
    encoder.encode_from_utf16(&src, &mut dst, false);
    println!("{:?}", dst);

    let src = [0xDC99u16];
    let mut dst = [0u8;4];
    encoder.encode_from_utf16(&src, &mut dst, true);
    println!("{:?}", dst);
}

fn utf16_to_utf8_2() {
    let mut decoder = encoding_rs::UTF_16LE.new_decoder();

    let src = [0x3Du8, 0xD8u8];
    let mut dst = [0u8;4];
    decoder.decode_to_utf8(&src, &mut dst, false);
    println!("{:?}", dst);

    let src = [0x99u8, 0xDCu8];
    let mut dst = [0u8;4];
    decoder.decode_to_utf8(&src, &mut dst, true);
    println!("{:?}", dst);
}

fn main() {
    utf16_to_utf8();
    utf16_to_utf8_2();
}

Per this sample code it seems only the decoder stores the surrogate while the encoder does not. This is counterintuitive to me, is this done intentionally, or is there a way to do the equivalent, or a bug?

hsivonen commented 2 years ago

This is intentional and documented.

The thinking behind this is that the decoder receives data from the wild, so stuff getting split across I/O buffers is normal and errors in the data are not programming errors in the application using the library.

In contrast, the encoder receives application-internal Unicode representations. In this case, the caller is expected to keep each of its internal buffers valid on a per-buffer basis. This is conceptually similar to the case of receiving application-internal UTF-8 and encoding it into output. However, in that case, UTF-8 validity is enforced on the type system level.

This design thinking doesn't quite fit the case where the encoder receives data from a JavaScript engine where the JavaScript program comes from the wild, and the encoder input is a sequence of DOMStrings as opposed to a sequence of USVStrings.

Sorry about this design decision not being a great fit for your use case. However, I'm reluctant to change the encoding_rs-level design here.