hsivonen / encoding_rs

A Gecko-oriented implementation of the Encoding Standard in Rust
https://docs.rs/encoding_rs/
Other
384 stars 55 forks source link

Non-streaming decode() appears to remove the BOM? #88

Closed dralley closed 2 years ago

dralley commented 2 years ago

https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L2974-L2980

Functionally, decode() and decode_with_bom_removal() seem pretty much the same? That doesn't seem correct? If there's a variant called "decode_with_bom_removal" then I would expect the standard variant not to remove the BOM.

Compare to:

https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L3019-L3030

It's totally valid to decode the BOM, the BOM is a unicode character like any other character. Decoding a UTF-16 document with a BOM should yield a UTF-8 document with a BOM. Otherwise, you would just use the BOM-removing version...

use encoding_rs::*;

fn main() {
    // Two characters, '1' and then BOM character
    println!("{:?}", UTF_16LE.decode(&[0x31, 0x00, 0xFF, 0xFE]).0.as_bytes()); 
    // Nothing - BOM removed
    println!("{:?}", UTF_16LE.decode(&[0xFF, 0xFE]).0.as_bytes());
}
[49, 239, 187, 191]
[]
dralley commented 2 years ago

The same thing happens with the streaming API. Am I missing something? Why does the decoder produced by new_decoder() strip the BOM when there is a separate method named new_decoder_with_bom_removal() that I am explicitly not using?

use encoding_rs::*; 

fn main() {

    let mut buf = [0; 100];

    dbg!(UTF_16LE.new_decoder().decode_to_utf8(
        &[0x31, 0x00, 0xFF, 0xFE],
        &mut buf,
        true,
    ));

    println!("{:?}", &buf);

    let mut buf = [0; 100];

    dbg!(UTF_16LE.new_decoder().decode_to_utf8(
        &[0xFF, 0xFE],
        &mut buf,
        true,
    ));

    println!("{:?}", &buf);
}
[src/main.rs:11] UTF_16LE.new_decoder().decode_to_utf8(&[0x31, 0x00, 0xFF, 0xFE], &mut buf,
    true) = (
    InputEmpty,
    4,
    4,
    false,
)
[src/main.rs:21] UTF_16LE.new_decoder().decode_to_utf8(&[0xFF, 0xFE], &mut buf, true) = (
    InputEmpty,
    2,
    0,
    false,
)

[49, 239, 187, 191, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
dralley commented 2 years ago

@hsivonen Is there a detail that I'm missing?

dralley commented 2 years ago

I could try to put up a PR, I just want to verify that there's not a reason behind this

hsivonen commented 2 years ago

AFAICT, these are working as documented.

dralley commented 2 years ago

@hsivonen If what you say is true then the name of decode_with_bom_removal() doesn't indicate the actual difference between decode_with_bom_removal() and decode(). At minimum that is highly confusing.

It sounds like there is no way to respect the BOM (use the BOM-indicated encoding) without removing it, which is problematic - it's valid unicode, and sometimes you do want to keep it. For instance, if the data came from UTF-16 LE, and you want to perform some processing on it as UTF-8, before re-encoding it as UTF-16 LE. You could write the BOM separately, but that is sometimes less convenient than simply leaving it there to begin with.

dralley commented 2 years ago

Furthermore I don't see any documentation indicating that decode() removes the BOM, could you point out where it is documented?

dralley commented 2 years ago

So ignoring the documentation issue, I can understand why this is the case due to the way the WHATWG spec defines how decode should be done (though the name "_with_bom_removal()" is still misleading and doesn't quite match what the spec does).

Would you be open to merging function which did not do this?

hsivonen commented 2 years ago

Furthermore I don't see any documentation indicating that decode() removes the BOM, could you point out where it is documented?

Good point. It says "BOM sniffing" without saying explicitly that BOM sniffing means BOM removal.

What's the use case for deciding the encoding from BOM but still letting the BOM show through to output?

dralley commented 2 years ago

I'm working on improving the encoding support of quick-xml. One idea I had is that a user may want to get either:

The latter would potentially simplify making edits to existing documents, given that some software wants to see a UTF-8 BOM even though the spec recommends against it.

In order to pull that off, the BOM would need to not be stripped automatically. Whether it gets removed or not (alongside other things) would be handled by the Reader.

I'm open to a counterargument that it's not a good idea, though. One potential issue with the use case I just described is that there would be no way to automatically handle indentation for new additions into the document, because automatic indentation would need to be turned off to "exactly reproduce" everything else. But I think we would still need to know whether the BOM was present, at least.

hsivonen commented 2 years ago

The BOM isn't part of the XML information set, so an XML use case hasn't arisen before. Does you exact reproduction mode also retain whitespace on either side of the equals sign for attributes, etc.?

Since this issue is filed against the non-streaming API, you can easily find out if there was a BOM by inspecting the buffer with Encoding::for_bom().

dralley commented 2 years ago

Well, the streaming API does the same thing as mentioned in the first comment after the OP, I just first encountered this while doing some experimentation. The streaming API is what we would be using in practice.

The BOM isn't part of the XML information set

It's not, but neither is inter-element whitespace or any data prior to the XML declaration, hence why I figured it might make sense to treat them similarly. The BOM is relevant to some (mostly Windows) software even though, for UTF-8, it shouldn't be.

Does you exact reproduction mode also retain whitespace on either side of the equals sign for attributes, etc.?

At the moment this is just a concept, the actual encoding support is more important and is the main focus. You're right that it's non-trivial and would have to encompass a lot more than just whitespace between events. But I think that would be possible. Whether it's actually worthwhile, I'm not sure yet.

hsivonen commented 2 years ago

Well, the streaming API does the same thing as mentioned in the first comment after the OP, I just first encountered this while doing some experimentation. The streaming API is what we would be using in practice.

I see. Still, I think at least for now, I'm going to continue to treat this as out of scope, since the BOM with BOM semantics isn't supposed to be part of the logical content of the stream.

dralley commented 2 years ago

the BOM with BOM semantics isn't supposed to be part of the logical content of the stream.

This is the part I'm unsure on. Nothing about the BOM is particularly special, it's just a unicode character, etc. Isn't it just part of the document, then? Unicode says it should present as an invisible zero-width non-breaking space, which implies that it is expected to make it through to the presentation layer sometimes, and not just be stripped off in all cases.

Various documents I've read don't spell it out explicitly but they make it seem like part of the data stream itself, more akin to the #! shebang line at the beginning of a Python/Perl/Bash script than magic bytes at the beginning of a zip file. In one sense, the shebang line isn't "part of the script", but in another sense, it's just as much a part of the script as any other comment in the script.

https://www.w3.org/International/questions/qa-byte-order-mark https://unicode.org/faq/utf_bom.html#bom1

hsivonen commented 2 years ago

See The Unicode Standard 15.0 D98 second bullet last sentence on page 131 (PDF page 157). See also the note in The Encoding Standard that explains that the naming doesn't match the naming in The Unicode Standard.

Note that the XML spec says of the BOM: "This is an encoding signature, not part of either the markup or the character data of the XML document." Also, the BNF in the XML spec doesn't consider the BOM part of the textual syntax.

dralley commented 2 years ago

That appears to be correct for UTF-16. Regarding UTF-8, the bullet points under D95 on the previous page are frustratingly unclear - it has no such language and actually seems to imply the opposite.

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data contain- ing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence . Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Stan- dard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.

But considering I really don't feel like lawyering the spec to that degree, point taken, we should probably just accept that the BOM will be stripped and force the user to write it themselves on the output side, if they want it.

I will file a new issue about documentation.