Unicode characters incorrectly escaped in encode

samhh commented 4 years ago

At least, that's what I think is happening. In a REPL, the following will fail:

Toml.decode (Toml.text "k") $ Toml.encode (Toml.text "k") "ü"
-- Left [ParseError (TomlParseError {unTomlParseError = "1:8:\n  |\n1 | k = \"\\252\"\n  |        ^\nInvalid escape sequence: \\2\n"})]

Looking at the encoding, here's what we're given:

Toml.encode (Toml.text "k") "ü"
-- "k = \"\\252\"\n"

And passing that into decode will fail per the above:

Toml.decode (Toml.text "k") "k = \"\\252\"\n"
-- Left [ParseError (TomlParseError {unTomlParseError = "1:8:\n  |\n1 | k = \"\\252\"\n  |        ^\nInvalid escape sequence: \\2\n"})]

Looking at the TOML spec, it looks like Unicode characters should be encoded with a \u prefix. Modifying the string to contain an extra u0 allows encoding to succeed, and I think that's what we want given that's roughly the decimal output in an online Unicode converter:

Toml.decode (Toml.text "k") "k = \"\\u0252\"\n"
-- Right "\594"

But I'm pretty ignorant about character encoding and am honestly not sure if that's the right output. :smile:

chshersh commented 4 years ago

Hi @samhh, thanks for submitting the issue! This indeed looks like an unexpected behaviour :disappointed: tomland uses Text internally, and during encoding Text is printed using the show function. The show does the escaping of all characters. You can reproduce this behaviour even easier in GHCi:

λ: show "ü"
"\"\\252\""

It looks like some smarter handling of Unicode characters is required to preserve TOML semantics.

Relevant code to change is here:

https://github.com/kowainik/tomland/blob/38d74d37efa876e57c01bcfecca56f032305d957/src/Toml/Type/Printer.hs#L141

A more interesting question is why such errors weren't caught by our property tests? :thinking: :thinking: :thinking:

dariodsa commented 3 years ago

I implemented the requested feature, now I only have to implement tests and figure it out why it wasn't caught by our test cases. Code needs some cleaning but it is working. :-)

dariodsa commented 3 years ago

Our tests was ok, but they were testing something different. Text was generated with unicode characters but they were written like \u010d, but they weren't generate in its real form, č. Thank you @samhh for catching that error. I will need to rewrite some tests but I will explain it in more details in PR.

kowainik / tomland

Unicode characters incorrectly escaped in encode #334