pelletier / go-toml

Go library for the TOML file format
https://github.com/pelletier/go-toml
Other
1.67k stars 208 forks source link

fix(marchaler): escape runes outside the multilingual plane #937

Closed JanDeDobbeleer closed 3 months ago

JanDeDobbeleer commented 3 months ago

I recently switched to your amazing toml library and noticed it has an issue quoting strings as literal which ruines the rendering in oh-my-posh when exporting a configuration (see https://github.com/JanDeDobbeleer/oh-my-posh/issues/4750?notification_referrer_id=NT_kwDOACYJb7I5Nzc3NzUxMDkzOjI0OTI3ODM#issuecomment-1989538542).

This PR ensure we only quote strings as literal when it only contains characters inside the multilingual plane and/or emoji.

There was a V1 test that, rightfully so, fails which I also adjusted for correctness.


Paste benchstat results here (running ATM, takes a loooong time). Will post when it's done.

pelletier commented 3 months ago

Thank you for your contribution!

I'm a little confused about the need to marshal strings with those characters into TOML basic strings instead of literal strings.

Looking at the spec, literal strings can contain the same characters as basic strings, except those that should be quoted:

Any Unicode character may be used except those that must be escaped: quotation mark, backslash, and the control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F).

So any other Unicode point, as long as it is correctly encoded in UTF-8, should be allowed in literal strings.

Do you mind elaborating on why this is a bug? I'm sorry if I've missed something.

JanDeDobbeleer commented 3 months ago

@pelletier it is allowed, but when you unmarchal these, also in go they will be string literals. And there we get into issues as this is in our case then also printed literally instead of interpreted. You can see the effect in the linked issue. They aren't literal strings in the object, yet always exported as such.

In our case, this struct:

{
    PowerlineSymbol: "\ue0b0"
}

Is exported as:

[[blocks.segments]]
    powerline_symbol = '\ue0b0'

while it should be:

[[blocks.segments]]
    powerline_symbol = "\ue0b0"

I'd argue there's a difference between the above and:

{
    PowerlineSymbol: `\ue0b0`
}

which is an actual string literal.

pelletier commented 3 months ago

The literal "\ue0b0" in Go represents a string with one code point, which should be marshaled in its UTF-8 encoding as '' by go-toml. That seems to be the case https://go.dev/play/p/oY0zaVTb9Tl

For go-toml to emit the TOML string "\ue0b0", the Go string needs to contain those exact characters. For example, any of those should work:

`\ue0b0`
"\\ue0b0"

I'm not familiar with oh-my-posh, but took at stab at looking that the config export code in that repo. This seems to be the relevant snippet:

https://github.com/JanDeDobbeleer/oh-my-posh/blob/3d7be9250c7b84a173fbf503670167570a47e45d/src/engine/config.go#L170-L211

Looks like all three export formats follow a pattern. First export using a format-specific library, then all apply the same post-processing routine escapeGlyphs:

https://github.com/JanDeDobbeleer/oh-my-posh/blob/3d7be9250c7b84a173fbf503670167570a47e45d/src/engine/config.go#L265-L333

From what I gather, this function goes over all runes in the output, and transforms some unicode points to escape them in a \u... notation.

This process probably only works in JSON, where strings are always represented in double quotes, and double quotes accept the \u... notation. In the case of TOML (and I suspect YAML, as the original poster reports a problem with it too), it would only work if the emitted string happens to be double quoted to accept the \u... notation as a way to express code points.

Assuming this analysis is correct, I don't think there is an issue with go-toml here. Maybe move the the glyph escaping to a pre-processing step, before calling the encoders? That way it would avoid needing to be format-specific, and work for all three formats?

JanDeDobbeleer commented 3 months ago

@pelletier I can validate that indeed. Allow me to check and report back. I didn't think about that anymore as that part was somewhere in the past and did miss that during my analysis.

JanDeDobbeleer commented 3 months ago

@pelletier confirmed, and sorry for missing that part completely. If I need to get a string representation of these code points, I'll figure that out later. For now that's not as important.

pelletier commented 3 months ago

No worries, glad you figured out a solution to your issue! 🙌🏻