Open samhh opened 4 years ago
Hi @samhh, thanks for submitting the issue! This indeed looks like an unexpected behaviour :disappointed:
tomland
uses Text
internally, and during encoding Text
is printed using the show
function. The show
does the escaping of all characters. You can reproduce this behaviour even easier in GHCi:
λ: show "ü"
"\"\\252\""
It looks like some smarter handling of Unicode characters is required to preserve TOML semantics.
Relevant code to change is here:
A more interesting question is why such errors weren't caught by our property tests? :thinking: :thinking: :thinking:
I implemented the requested feature, now I only have to implement tests and figure it out why it wasn't caught by our test cases. Code needs some cleaning but it is working. :-)
Our tests was ok, but they were testing something different. Text was generated with unicode characters but they were written like \u010d
, but they weren't generate in its real form, č
. Thank you @samhh for catching that error.
I will need to rewrite some tests but I will explain it in more details in PR.
At least, that's what I think is happening. In a REPL, the following will fail:
Looking at the encoding, here's what we're given:
And passing that into decode will fail per the above:
Looking at the TOML spec, it looks like Unicode characters should be encoded with a
\u
prefix. Modifying the string to contain an extrau0
allows encoding to succeed, and I think that's what we want given that's roughly the decimal output in an online Unicode converter:But I'm pretty ignorant about character encoding and am honestly not sure if that's the right output. :smile: