http-rs / surf

Fast and friendly HTTP client framework for async Rust
https://docs.rs/surf
Apache License 2.0
1.45k stars 119 forks source link

Bug: Response::body_json could not convert from an UTF-8 BOM source file #228

Open usagi opened 3 years ago

usagi commented 3 years ago

Target

Detail

Repro

  1. Prepare an UTF-8 BOM JSON file.; BOM bytes is 0xef 0xbb 0xbf.
  2. Deploy the JSON file and run a httpd.
  3. surf::get and body_json => SerdeJsonError(Error("expected value", line: 1, column: 1))

Workaround

// It will be the error if response has the BOM.
// let my_struct = my_response.body_json::<MyStruct>();
  1. Raw implementation:
let my_response_body_string = my_response.body_string();
let my_response_body_string = if my_response_body_string.starts_with("\u{feff}") { &my_response_body_string[3..] } else { &my_response_body_string[..] };
let my_struct: MyStruct = serde_json::from_str(my_response_body_string);
  1. Or use strip_bom:
use strip_bom::StripBom;

let my_response_body_string = my_response.body_string();
let my_response_body_string = my_response_body_string.strip_bom();
let my_struct: MyStruct = serde_json::from_str(my_response_body_string);

Notes:

  1. This error is SerdeJsonError, but a character encoding issue might not issue of serde_json or serde crate. Because, serde_json is designed for UTF-8 stream body only, not for a file directly or BOM. Thus, I think the BOM issue is not a serde_json or serde issue, it is a library user side issue. If you authors think it is not a surf issue, it is a serde_json or serde issue then I'll throw the issue to serde_json or serde repos.
    • serde_json::from_reader (for an IO stream, ≈ Input UTF-8 character stream)
    • serde_json::from_str (≈Input UTF-8 character stream).
  2. It might be not a std::string::String issue because https://github.com/rust-lang/rfcs/issues/2428
  3. JSON (ECMA-404, RFC8259) is defined the character encoding to use only UTF-8, but the BOM is not defined. Thus, JSON file with UTF-8 BOM is not an invalid in specification. Yes, of course I was think nobody who use UTF-8 BOM to JSON file before today, but a Japanese government institution used it. 🥺
goto-bus-stop commented 3 years ago

Is this still true in the 2.0 alphas? 2.0.0-alpha.0 and later include support for non-UTF8 encodings, but that code should also handle UTF-with-BOM.

usagi commented 3 years ago

@goto-bus-stop Unfortunately yes, this problem occur in these any versions:

Example repro code

main.rs

use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
struct Sushi {
    sushi: String,
}

fn main() {
    // UTF-8N; `{ "sushi": "🍣" }`
    let mut r = smol::block_on(surf::get("http://localhost:50000/n-sushi.txt")).unwrap();
    let sushi = smol::block_on(r.body_json::<Sushi>()).unwrap();
    println!("n{:?}", &sushi);
    // UTF-8 BOM; `\u{feff}{ "sushi": "🍣" }`
    let mut r = smol::block_on(surf::get("http://localhost:50000/bom-sushi.txt")).unwrap();
    let sushi = smol::block_on(r.body_json::<Sushi>()).unwrap();
    println!("b{:?}", &sushi);
}
[dependencies]
serde = {version = "1.0.115", features = ["derive"]}
serde_json = "1.0.57"
smol = "1.0.0"
surf = {git = "https://github.com/http-rs/surf", tag = "v2.0.0-alpha.5"}
#surf ="1.0.3"

Example UTF-8N and UTF-8 BOM files

  1. hexdump --canonical n-sushi.txt # It's UTF-8N(Reference)
00000000  7b 20 22 73 75 73 68 69  22 3a 20 22 f0 9f 8d a3  |{ "sushi": "....|
00000010  22 20 7d                                          |" }|
00000013
  1. hexdump --canonical bom-sushi.txt # It's UTF-8 BOM (Target of this test)
00000000  ef bb bf 7b 20 22 73 75  73 68 69 22 3a 20 22 f0  |...{ "sushi": ".|
00000010  9f 8d a3 22 20 7d                                 |..." }|
00000016

Repro procedure

  1. simple-http-server -p 50000
  2. cargo run

And then I got the result:

// cargo process

    Finished dev [unoptimized + debuginfo] target(s) in 0.41s
     Running `target\debug\se.exe`

// The result part of the first UTF-8N reference

nSushi { sushi: "🍣" }

// The result part of the second UTF-8 BOM testing

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: expected value at line 1 column 1', src\main.rs:15:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\debug\se.exe` (exit code: 101)