haskell / aeson

A fast Haskell JSON library
Other
1.26k stars 321 forks source link

Consider escaping a wider range of newline characters #1092

Closed joe-warren-permutive closed 6 months ago

joe-warren-permutive commented 6 months ago

The Unicode standard, chapter 5.8, lists 7 different types of newline character.

The string escaping code in Data.Aeson.Text.hs appears to escape 4 out of 7 of these characters: the characters that are not escaped are NEL (x0085), LS (x2028) and PS (x2029).

I've encountered at least one parser that treats these values as a newline, and will therefore fail when encountering them unescaped in a Json string.

I'd like to suggest updating the escaping logic, so that these characters would be escaped.

RFC-8259 section 7 is fairly clear that these do not need to be escaped:

All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

However, it also states:

Any character may be escaped.

I'd suggest that escaping all newline characters would be more robust.

joe-warren-permutive commented 6 months ago

I've checked a range of different JSON printers

Firefox, does escape LS and PS (and treats them differently to other characters in the same block)

> JSON.stringify("\u2026")
'"…"'
> JSON.stringify("\u2027")
'"‧"'
> JSON.stringify("\u2028")
'"\u2028"'
> JSON.stringify("\u2029")
'"\u2029"'
> JSON.stringify("\u202a")
'"‪"'

However, jq, Circe, NodeJS, and Chrome, print those values unescaped.

I don't think that's a compelling argument for or against.


I'd be happy to make a PR for the fix myself, but this would be my first time contributing to Aeson, so I'm keen to figure out if it would be accepted before starting work.

joe-warren-permutive commented 6 months ago

This Issue from 2015 heavilly implies that "only escaping values required by the spec" is a deliberate design decision.

I'm leaning away from PRing this, as it breaks the current property that strings are encoded cannonically, according to RFC-8785, and any changes to the string escaping would require copying the current logic into Data.Aeson.RFC8785.

phadej commented 6 months ago

Any choice here would be arbitrary, as you note that different implementations do different things. There is some reasoning behind the current choice, and I'm sure many are also (implicitly) depending on the current behavior.