lifthrasiir / rust-encoding

Character encoding support for Rust
MIT License
284 stars 59 forks source link

BOM-aware Unicode encodings #17

Open lifthrasiir opened 11 years ago

lifthrasiir commented 11 years ago

This issue was spotted during the removal of TextEncoder and TextDecoder (#4). TextDecoder has an ability to automatically strip the BOM (U+FFFD) from the input string if any. We need to emulate this in a separate encoding, perhaps BOMAwareUTF8Encoding (which whatwg_name() is still utf-8)? This use case itself can be handled better with decoders with a fallback encoding (#19), but we may need to require BOM-attached Unicode encodings from time to time: many applications of UTF-16 require BOM, for example.

SimonSapin commented 11 years ago

I think that BOMAwareUTF8Encoding the wrong approach. Rather, what’s needed is what the spec calls decode.

It could be be a BOMDecoder (or other name) that takes a "fallback encoding" parameter. When the input starts with a BOM, the BOM is stripped and the corresponding encoding is used. Otherwise, the fallback encoding is used.

This decoder should always be used for formats that support multiple encoding, because the BOM (by proximity) is more accurate than other metadata.

lifthrasiir commented 11 years ago

@SimonSapin I have updated the description. I agree that this use case should be handled elsehow, see #19 for a separate discussion. BOM-aware encoding itself might be useful by itself though.