toml-rs / toml

Rust TOML Parser
https://docs.rs/toml
Apache License 2.0
684 stars 103 forks source link

Uneeded escapes with multiline string character escapes #752

Open nathaniel-daniel opened 2 months ago

nathaniel-daniel commented 2 months ago

Hello, I'm trying to write and load a string with newlines and tabs with toml, and I'm seeing some strange behavior with CR and HT, though this might be the intended behavior.

Here's a small reproduction:

// [dependencies]
// serde = { version = "1.0.204", features = ["derive"] }
// toml = "0.8.14"

#[derive(Debug, serde::Serialize, serde::Deserialize)]
struct Simple {
    message: String,
}

fn main() {
    let simple = Simple {
        message: "\tHello\r\nWorld!".into(),
    };
    let output = toml::to_string(&simple).unwrap();
    println!("{output}");
    let round_tripped: Simple = toml::from_str(&output).unwrap();

    assert!(simple.message == round_tripped.message);
}

I would expect (or prefer) the output to be:

message = """
    Hello
World!"""

Instead, the output is:

message = """
\tHello\r
World!"""

CR and HT are escaped.

According to the toml spec, for multiline strings:

Any Unicode character may be used except those that must be escaped: backslash and the control characters other than tab, line feed, and carriage return (U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F)

It looks like this library is escaping some characters that don't necessarily need to be escaped. However, I request that at least CR be written unescaped, as the library is currently splitting CRLFs. This makes interaction with any line ending normalizers, like git, very messy.

I would also prefer tabs to be written unescaped as well. I am using tabs to format the interior content of a multiline string, and the formatting is lost (at least visually) when passed through this library.

epage commented 2 months ago

I'm fine with adding \t. For reference, the code for this lives at https://github.com/toml-rs/toml/blob/c76675627684c53e6101b9ea35ba90b1b8428cbd/crates/toml_edit/src/encode.rs#L349-L366

I believe \r would require more careful handling as I think \r can only show up as unescaped when preceding a \n. To confirm that, the grammar is at https://github.com/toml-lang/toml/blob/main/toml.abnf

nathaniel-daniel commented 2 months ago

I believe \r would require more careful handling as I think \r can only show up as unescaped when preceding a \n.

On I missed that. I think you're right. "Carriage returns (U+000D) are only allowed as part of a newline sequence." Doesn't seem that much more difficult to peek the next char at least.

That's a bit odd though, I could have sworn Macs used CR for newlines.

I've implemented the fix here, but ran into a few issues.

One issue is with the valid/spec/string-2.toml test:

---- valid/spec/string-2.toml ----
Unexpected decoding.
Expected
{
  "str2": {
    "type": "string",
    "value": "Roses are red\nViolets are blue"
  },
  "str3": {
    "type": "string",
    "value": "Roses are red\r\nViolets are blue"
  }
}
Actual
{
  "str3": {
    "type": "string",
    "value": "Roses are red\nViolets are blue"
  },
  "str2": {
    "type": "string",
    "value": "Roses are red\nViolets are blue"
  }
}

The relevant toml file is:

# On a Unix system, the above multi-line string will most likely be the same as:
str2 = "Roses are red\nViolets are blue"

# On a Windows system, it will most likely be equivalent to:
str3 = "Roses are red\r\nViolets are blue"

Previously, this library would load str3 as something like:

"""
Roses are red\r
Violets are blue"""

Note that \r is a \ and an r character, not CR. I think this is a bug because that isn't what \r means in the context of a toml string.

With my patch:

message = """
Roses are red
Violets are blue"""

I think this second version is more correct, as it properly parses the CRLF. However, this made me realize that this library seems to be converting CRLFs into LFs on-load, which is why the test fails; It previously did not transform "\r\n" into "\n".

Implementing better CRLF handling, I ran into another issue with the pretty::pretty_tricky test. Here's the failure:

---- pretty::pretty_tricky stdout ----
thread 'pretty::pretty_tricky' panicked at crates\toml\tests\testsuite\pretty.rs:86:5:

--- Expected
++++ actual:   In-memory
   1    1 | [example]
   2    2 | f = "\f"
   3    3 | glass = """
   4    4 | Nothing too unusual, except that I can eat glass in:
   5    5 | - Greek: Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα.
          ⋮
   7    7 | - Hindi: मैं काँच खा सकता हूँ, मुझे उस से कोई पीडा नहीं होती.
   8    8 | - Japanese: 私はガラスを食べられます。それは私を傷つけません。
   9    9 | """
  10   10 | r = "\r"
  11   11 | r_newline = """
  12      - \r
       12 +
  13   13 | """
  14   14 | single = "this is a single line but has '' cuz it's tricky"
  15   15 | single_tricky = "single line with ''' in it"
  16   16 | tabs = """
  17   17 | this is pretty standard
  18      - \texcept for some   \ttabs right here
       18 +     except for some         tabs right here
  19   19 | """
  20   20 | text = """
  21   21 | this is the first line.
  22   22 | This has a ''' in it and \"\"\" cuz it's tricky yo
  23   23 | Also ' and \" because why not
  24   24 | this is the fourth line
  25   25 | """

Updating the \ts to tab characters is easy enough. However, the issue is with the \r character. Currently, this is a CR followed by an LF, separately. During writing, the CR escape is processed into a raw CR since it precedes an LF. Then, the raw LF is added to the string. As a result, the library can't differentiate between \r\n and \\r\n, which I think is fine since the raw string value is the same for both strings; data isn't really lost, only the representation is different. I think the only way to fix this is to track whether each character was created from an escape. Alternatively, I guess just always escaping \r is an option but that seems like a bad solution imo since I think not splitting CRLFs when needed is worse that combining CRLFs when not needed. Also, I'm not entirely sure what that test is supposed to catch, since apparently CR newlines in toml aren't a thing.

epage commented 2 months ago

Also, looks like we carried over this behavior from toml 0.5 (0.6 was a rewrite). https://github.com/toml-rs/toml-rs/blob/464e7153879be75965eb067062cc8cc6743b7eb6/src/ser.rs#L638-L655

Before moving forward on this, we should look into what went into the decisions back then.

nathaniel-daniel commented 2 months ago

Also, looks like we carried over this behavior from toml 0.5 (0.6 was a rewrite). https://github.com/toml-rs/toml-rs/blob/464e7153879be75965eb067062cc8cc6743b7eb6/src/ser.rs#L638-L655

Before moving forward on this, we should look into what went into the decisions back then.

Looks like it was carried over from the rewrite from rustc-serialize, which was present from essentially the start of the crate, with no reasoning.

Guessing as to why, here's the earliest version of toml's abnf that I could find. It appears that \t used to be needed to be escaped inside a string. Also, I don't think any of the past iterations of this library supported emitting a multiline string.

epage commented 1 month ago

I think I'm willing to give it a shot. I do know some users are very touchy around output changes, so I'm going to mark this as a breaking change. I'll also likely reach out to potentially affected people, like Cargo.

nathaniel-daniel commented 1 month ago

Great, thanks.

Also, should I file the bug from earlier as a separate issue?

const TOML: &str = "str2 = \"\"\"\nRoses are red\nViolets are blue\"\"\"\nstr3 = \"\"\"\nRoses are red\r\nViolets are blue\"\"\"";

fn main() {
    let data: toml::Table = toml::from_str(TOML).unwrap();
    let actual_str3 = data.get("str3").unwrap().as_str().unwrap();
    let expected_str3 = "Roses are red\r\nViolets are blue";
    assert!(actual_str3 == expected_str3, "{actual_str3:?} != {expected_str3:?}");
}
thread 'main' panicked at src/main.rs:7:5:
"Roses are red\nViolets are blue" != "Roses are red\r\nViolets are blue"
epage commented 1 month ago

The issue being that we convert \r\n to \n? I guess one could be created but unsure how much we'll care.