serde-rs / json

Strongly typed JSON library for Rust
Apache License 2.0
4.85k stars 554 forks source link

Deserializing error with UTF-8 BOM (Byte Order Mark) Content #1115

Open zenoxs opened 7 months ago

zenoxs commented 7 months ago

Deserializing Panic with UTF-8 BOM (Byte Order Mark) Content

I encounter an issue when attempting to deserialize a string encoded in UTF-8 with a Byte Order Mark (BOM). The deserializer throws the following error: Error("expected value", line: 1, column: 1).

How to Reproduce

To reproduce the issue, encode a JSON file in UTF-8 with BOM and use from_reader or from_str for deserialization.

Workaround

As a temporary workaround, I check if the file content begins with the first three bytes of the BOM and remove them if present:

use std::fs;

fn main() {
    // Specify the path to your file
    let file_path = "path/to/your/file_with_bom.json";

    // Read the file to a Vec<u8>
    let mut data = fs::read(file_path).unwrap();

    // UTF-8 BOM is three bytes: EF BB BF
    if data.starts_with(&[0xEF, 0xBB, 0xBF]) {
        // Remove the first three bytes (the BOM)
        data = data[3..].to_vec();
    }

    // Proceed with deserialization...
}
valaphee commented 4 months ago

One way would be to handle it in Rust itself https://github.com/rust-lang/rfcs/issues/2428 at least IETF RFC 3629 doesn't forbids it. (even though I'm personally against it, as it is a protocol detail)

But your file is theoretically not compliant with IETF RFC 7159 (even though this also not strictly forbidden in the beforementioned RFC as its a protocol detail)

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Either way its at least totally valid to ignore the BOM to be still conformant.