Recommend avoiding invalid surrogate pairs

GULPF commented 5 years ago

Unicode code points in the range U+D800 to U+DFFF are used for encoding UTF-16 surrogate pairs, and can not occur on their own in a valid Unicode sequence. However, the \uHHHH escape sequence can be used to construct such a an invalid Unicode sequence, e.g "\uDEAD".

The interpretation of invalid Unicode sequences like "\uDEAD" is not specified in the JSON, JSON5 or ES5 (I think) specs. However, it is specified in the ES6 spec:

A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value.

This also matches the behavior of the JSON5 reference implementation. I think it's best if this is also clarified in the JSON5 spec.

jordanbtucker commented 5 years ago

Thanks. RFC 8259 addresses this, so I agree that the JSON5 spec should address this too.

ell1e commented 4 years ago

May I ask why U+.... is limited to the UTF-16 range and surrogates are used in the first place? If that was changed, then any sort of escaping would also automatically become moot because even the higher code points could be put in directly without surrogate workarounds

jordanbtucker commented 4 years ago

@ell1e Unicode escapes are limited to UTF-16 code units because that's the syntax available in ES5.

ell1e commented 4 years ago

Oh, how unfortunate. I get if you want to stick to it since JSON traditionally mapped to valid JavaScript, but I would suggest at least considering to change this if an opportunity presents itself (new revision of the spec, if that is ever done). After all, technically it shouldn't be too hard to write a JavaScript parser that reads this, even when ES5 itself wouldn't accept the according escape.

ell1e commented 4 years ago

Just to explain this further, I feel like there are a lot of headaches in the world because of UTF-16 surrogates. E.g. the ES6 specification leads to confusing surrogating-the-surrogates in languages like Python 3 which use surrogates already for implementing broken values like in WTF-8 which is needed to retain broken surrogates in e.g. broken UTF-16 implementations like windows in an UTF-32 string: in Python 3, I believe any broken surrogates will themselves need to be treated as binary and remapped with other surrogates if it is supposed to be a non-lossy conversion. (At least I'm quite sure that is what CPython will do internally, if you decode Unicode with error handling being surrogate escaping.) That sort of surrogate-hacking around broken surrogates is just mind-bending to me, and it confuses me why we would want more of that in existence.

Edit: obviously, the above isn't a technical problem since this will be handled transparently, so it works just fine. I just find it conceptually so weird that I'm wondering, why would you even want to encourage people to keep using broken UTF-16 surrogates?

I am therefore also suggesting that it might be better to reject broken UTF-16 surrogates as invalid due to this, because if one really wants them they can be encoded with \x.... anyway and I feel like \U+... should suggest something that is actually valid, meaningful Unicode. But ideally, I just think \U+... should be an unbounded value for all possible code points and surrogates just disallowed entirely, because then this whole mess wouldn't be required for people to just input the Unicode character they want.

But that's just my opinion, obviously!

jordanbtucker commented 4 years ago

@ell1e You make some good points. I'm wondering why the ECMAScript Specification decided to keep unmatched surrogates rather than reject them as invalid, and whether their reasoning should pertain to JSON5.

This restriction does not exist in the JSON spec either, but that may be because so many JSON documents already exist in the wild, and adding that restriction would invalidate some JSON documents, causing interoperability issues.

This technically would be a breaking change, but it would still allow JSON5 to be backward compatible with ES5 and JSON.

I'll have to think about this some more.

ell1e commented 4 years ago

Ouch, I never noticed these broken code-points-but-actually-UTF-16 actually were in original JSON, I assumed \U+... was a JSON5 addition. Well that is a bummer. And I checked the spec, it says:

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

And that's kind of all on it.

Honestly, it feels like they just mixed up unicode character and utf-16 encoding (which I think was common before utf-32 became more adopted as the "true" representation) and as if they didn't even think about the special case of invalid surrogates. As didn't Windows, or anyone else doing UTF-16 without thinking about how to integrate surrogates properly when they were added... oh well.

I guess if it's broken in JSON that's kind of a point for just sticking with it :woman_shrugging:

jordanbtucker commented 4 years ago

Oh yeah. JSON5 is a superset of JSON, so if it's allowed in JSON it needs to be allowed in JSON5. Adding this restriction would actually break compatibility with JSON. So it's a no go.

ell1e commented 4 years ago

Yeah I agree. Ignore all I said :joy:

Edit: I mean to be fair, JSON doesn't really say it's allowed either. But in practice probably most implementations will allow it, so probably too late to try to curb it now.

jordanbtucker commented 4 years ago

Yeah I agree. Ignore all I said 😂

Your points are still valid, so I think it would be good to add a recommendation to avoid invalid surrogate pairs, similar to how avoiding duplicate names and using RFC 3339 formatted dates are recommendations, but not hard requirements.

ell1e commented 4 years ago

Edit: the following is basically all nonsense, nevermind :smile: ~~If i may point out, I think JSON5's addition of \x... complicates this though:~~

Basically I think it is common that highlevel languages with an UTF-32 string type support retaining either binary garbage or invalid surrogates. The common approach in language runtimes (e.g. Python 3's surrogateescape mode which is used by default when it converts external filenames/paths from raw bytes to its internal UTF-32 strings) is to map either of those into the surrogate code point range since that one is not used in valid UTF-32 usually, so it's free for such internal uses. But you can't map both of these uses into that range without numeric overlap I don't think, and it's usually not necessary either.

But it's necessary for JSON5, because it's not defined how \x... relates to a broken UTF-16 surrogate. If it was specified that an invalid UTF-16 surrogate should be treated as the utf-16 binary encoding of that sequence (that could also be specified via \x...) and those two are interchangeable then that would work, but that would kind of collide byte values and code points in a quite confusing way. But without this relation being defined, you kind of have two uses crammed into a range that in some UTF-32 based runtimes can't be stored both.

Am I making sense at all? I know this is really complicated, but this might also be a real problem in corner cases. After all, for JSON decoding & re-encoding usually gives the same content in a way agreed upon by different implementations of it, so that might be expected for JSON5 too.

GULPF commented 4 years ago

I'm not sure I fully understand this discussion, but it's my understanding that \xHH is not a byte escape sequence, it's a unicode escape sequence for the range U+0000-U+00FF. From the JSON5 spec:

Any character may be escaped. If the character is in the Basic Latin or Latin-1 Supplement Unicode character ranges (U+0000 through U+00FF), then it may be represented as a four-character sequence: a reverse solidus, followed by the lower case letter x, followed by two hexadecimal digits that encode the character’s code point. A reverse solidus followed by the lower case letter x must be followed by two hexadecimal digits.

As such \xHH cannot be used to represent arbitrary byte sequences, it's just a shorthand for \u00HH and should not have an effect on this issue.

ell1e commented 4 years ago

Oh, haha, my bad. Yeah in that case it's moot, it seems like I should just read the spec better and write less :woman_facepalming: Edit: I'm wondering though, isn't that quite unusual? I've never seen \x... used like that anywhere else

jordanbtucker commented 4 years ago

As such \xHH cannot be used to represent arbitrary byte sequences, it's just a shorthand for \u00HH and should not have an effect on this issue.

@GULPF, you hit the nail on the head.

@ell1e, \xHH escapes in strings are more common than you might think. Off the top of my head, I know that C, C#, Python, Ruby, PHP, and of course JavaScript have them.

Java is the exception, but escapes in Java have always been weird. For example, "\u000A" is escaped by the compiler before parsing the string, which gives a compiler error because strings can't contain new line characters.

jordanbtucker commented 4 years ago

@ell1e I think I misunderstood your question. ECMA languages, like Java, JavaScript, and C# treat strings as a sequence of UTF-16 code units. So \xHH escapes are seen as 16-bit code units rather than 8-bit bytes.

ell1e commented 4 years ago

I guess this is more of a C/C++ thing, that \x... is not mapping to a code point but to a raw binary value. Interesting, I never realized this is commonly different in scripting languages. Unicode is weird :+1: also thanks for being so patient :joy:

jordanbtucker commented 4 years ago

ECMA languages are the exception. Most other languages store strings as bytes, so \xHH escapes will probably be bytes most of the time.

Rust is another exception as it stores character strings as UTF-8 and only allows \xHH escapes with values up to 0x7F. Rust also has byte strings, which are ASCII but allow \xHH escapes to store full range bytes.

ell1e commented 4 years ago

Yeah the Rust behavior is what I expected \x... to do. Maybe that would be an argument to leave that extension away? This different behavior did really surprise me, although I checked and Python 3 appears to treat it like that too, so it's not just JS.

dalle commented 3 years ago

It would be nice to have support for \Unnnnnnnn though (capital U).

The G clef character "𝄞" (U+1D11E) would be escaped as \U0001D11E.

spitzak commented 2 years ago

There is absolutely a requirement that \xNN turn into the BYTE with the value 0xNN. Otherwise it is impossible to store arbitrary bytes as strings in JSON5. This makes it impossible to use JSON5 to represent any data that is allowed to contain arbitrary bytes, such as Unix filenames. Apparently Windows users are not living in fantasyland and that is why invalid \uNNNN sequences are always supported by any Windows software, they realize the absolute nightmare of having a subset of allowed filenames or data fields not be able to be stored or retrieved from strings.

I very much recommend that a byte-based api be added that always returns any \xNN sequences as the exact equal byte, and turns any incoming invalid UTF-8 bytes (which will always have the high bit set) into \xNN sequences. It must also turn any invalid \uNNNN sequences (unpaired surrogates) into the obvious 3-byte UTF-8-like encoding, though I am unsure if there is a need for the reverse conversion (to \uNNNN rather than to three \xNN sequences).

There are all kinds of variations of what to do with any "wide character" api for the library. \xNN could turn into \u00NN for back compatibility, or into \uDCNN like in WTF-8 for a somewhat more lossless experience. I don't have any real opinion on this, as the 8-bit string api is what is really needed.

jordanbtucker commented 2 years ago

@spitzak Thanks for the suggestions.

Otherwise it is impossible to store arbitrary bytes as strings in JSON5. This makes it impossible to use JSON5 to represent any data that is allowed to contain arbitrary bytes, such as Unix filenames.

How does JSON currently handle this issue?

spitzak commented 2 years ago

I believe it is somewhat random whether \xNN turns into the given byte value or into a pair of bytes (first one \xC2 or \xC3), depending on the JSON library. Typically a single byte if it uses 8-bit strings internally, two bytes if it uses 16-bit strings internally. My recommendation is that the single byte be made official behavior.

jordanbtucker commented 2 years ago

But any JSON library that parses \x escapes is non-standard because JSON does not allow \x escapes.

In my experience, byte sequences are stored as arrays of numbers in JSON.

spitzak commented 2 years ago

Yes, making an array of numbers is one of the possible solutions, however that is extremely user-unfriendly when 99.99999% of the strings actually are readable text. The reason to allow invalid bytes is so this does not have to be done and filenames, including the valid portions of ones containing invalid bytes, can remain readable. This also complicates any program attempting to store into json as it must test the string before storage to see if it should be converted to a numerical array, it also means it has to check for both possibilities when reading. It could also be a security hazard as this allows obfuscation of filenames by writing them as byte values.

jordanbtucker commented 2 years ago

The reason to allow invalid bytes is so this does not have to be done and filenames, including the valid portions of ones containing invalid bytes, can remain readable.

I see your point, but I'm not sure it's wise to redefine what a string is in JSON5 to accommodate this single use case of mixed text and bytes.

This also complicates any program attempting to store into json as it must test the string before storage to see if it should be converted to a numerical array.

If your string is really an array of bytes, then it should be stored as an array of bytes, not a string. Both ECMAScript and JSON (ECMA-404) define strings as sequences of Unicode code points, whereas JSON (RFC 8259) and JSON5 define strings as sequences of Unicode characters. The implication is that strings are meant for text, not binary data.

It could also be a security hazard as this allows obfuscation of filenames by writing them as byte values.

I'm not convinced that an array of bytes is any less of a security risk than "\x6d\x61\x6c\x69\x63\x69\x75\x73", which is just as valid.

spitzak commented 2 years ago

Yes it would be a good idea to disallow \xNN for bytes that are easily encoded another way, for security reasons. This is a problem right now even if you ignore the invalid UTF-8 problem. It would be nice if various ASCII strings of interest can always be found with a grep command.

If invalid bytes require an array of numbers then you are forced to use this unsafe pattern. But it may be plausable to use:

["characters", 0x80, 0x81, "more characters", 0xC0, "xxx"]

I am however very concerned that this unusual syntax, and the need to support it in applications rather than the parsing library, is making things worse, not better. I really don't see any good solution other than fixing the meaning of \xNN.

jordanbtucker commented 2 years ago

Edit: Please disregard the comments below as they don't really make sense on second look.

Since JSON5 is already widely in use, disallowing certain behavior that was previously allowed would be a breaking change. So, any conclusions we draw would need to be phrased as a recommendation in the spec.

After considering all comments in this discussion and doing additional research, I think the best approach is to recommend that JSON5 authors and generators avoid invalid surrogate pairs when processing strings that represent Unicode text, but not when processing strings that represent binary data or text that does not conform to the Unicode standard, like in the case of POSIX filenames. Perhaps we should also recommend that \u escapes only be used in Unicode text string while \x escapes only be used in binary data strings. Encoding whether a string represents Unicode text vs binary data will not be a feature of JSON5. Determining whether a string represents Unicode text vs binary data is outside the scope of JSON5, and it would likely be handled by data contracts (e.g. schema).

There is precedent for allowing non-Unicode text in a string in JSON, which is built on the design of JavaScript. For example, the btoa function requires its input to be a binary string. However, we may add a recommendation around encoding binary data as Base64 as is being discussed in #35.

spitzak commented 2 years ago

That sounds like what I was complaining about. I do not want to have to write all the filenames as Base64 just because 0.00001% of them have some invalid UTF-8 in them! There must be a human-readable way to make text that contains invalid UTF-8 and invalid UTF-16.

jordanbtucker commented 2 years ago

@spitzak I can't tell if you're agreeing with my comments or not. If not, can you please propose an alternate recommendation to the spec.

spitzak commented 2 years ago

I'm now not sure I understand what you are saying. What I want is the sequence "\xNN" to turn into the byte with the value 0xNN when a quoted string is read from the JSON5 library using a UTF-8 api. It also should return \uNNNN as 3 bytes when U+NNNN is an unpaired surrogate, and \uNNNN\uMMMM as 4 bytes when they are a proper pair of surrogates. In addition the reverse conversions should be done when a UTF-8 string is given to the JSON5 library.

I am confused by the discussion of "text vs binary". Unless there are two different API's for the JSON library, there is no way for it to tell so it has to make these escape sequences work in all strings. I guess "recommendation" means it does work all the time and you are recommending that it not be used, but I don't consider this really a useful part of the spec. You can also "recommend" that all the words in a string be spelled correctly but this has nothing to do with JSON checking or enforcing that.

jordanbtucker commented 2 years ago

@spitzak So, I went back and read your original comments, and I think the issues you raised, while valid concerns, are moot in the context of this issue.

Unpaired surrogates can only occur when Unicode code units within the range of U+D800 through U+DFFF are used. Note that these code units are greater than 0xFF so they a are never represented by single bytes in Unicode.

The \xHH syntax represents a Unicode code point with the given hexadecimal value. Since it has a maximum value of 0xFF it can never represent a surrogate.

I think some confusion comes in when we start talking about UTF-8 and UTF-16 in the context of binary strings. These are two different things. UTF-8 and UTF-16 define how Unicode code points should be converted to and from bytes, and code points and bytes do not have a one-to-one relationship. However, binary strings are sequences of Unicode code points, limited to the range of U+0000 through U+00FF, that each represent a single byte, giving code points and bytes a one-to-one relationship. Note that it doesn't matter whether the binary string is encoded as UTF-8 or UTF-16 because we are not analyzing the UTF-8 or UTF-16 bytes. We are analyzing the code point values as if each one was a byte.

So, I don't understand what you mean when you say you want \xHH to turn into a byte when using a UTF-8 API. The string 'a\x80b' is not UTF-8. It's just a string, a sequence of Unicode code points with the values 0x61, 0x80, and 0x7A. If you were to convert this string to UTF-8, then it would be a sequence of bytes with the values 0x61, 0xC2, 0x80, and 0x7A. But if you were interpreting the string as binary data, then you wouldn't care what the UTF-8 representation is. You only need to care about the code point values.

Recommending that JSON5 authors and generators avoid invalid surrogate pairs has no bearing on the \xHH syntax because they can never represent surrogates.

spitzak commented 2 years ago

Sorry about the confusion about bytes and unpaired surrogates. They are not related.

By "UTF-8" api I mean a function in the JSON5 library that returns an array of bytes that have been set based on quoted text in the file, which converts the character 'A' into 0x41, and converts the character '€' into 0xE2,0x82,0xAC (the UTF-8 encoding of the euro sign).

My primary recommendation is that the UTF-8 api convert the text "\xAB" into the single byte 0xAB.

There is also another api that takes a UTF-8 string and writes a quoted representation of it to a JSON5 file, this must do the exact opposite conversion (ie it accepts any array of bytes and figures out a way to quote it, using \xNN sequences when necessary).

In addition there is a "UTF-16" api that takes an 'A' in a quoted string and turns it into the 16-bit word 0x0041, takes '€' and turns it into 0x20AC, and takes Unicode code points greater than U+FFFF and turns them into two 16-bit words (a surrogate pair).

My other recommendations, all are far less important than the "\xAB" to single byte support:

The UTF-16 api converts the text "\uDCAB" and turns it into the unpaired surrogate with the value 0xDCAB. I believe this was the original subject of this bug report.
The UTF-8 api converts an unpaired "\uDCAB" into 3 bytes with the obvious "UTF-8-like" encoding of this value. This means the UTF-8 api can be used to retrieve a string, and then the calling program can do it's own conversion to UTF-16, and arrive at the same string the UTF-16 api would have returned.
The UTF-8 api converts a surrogate pair represented by, for instance "\uD800\uDC00" into the correct 4-byte UTF-8 encoding of that Unicode code point (it does not turn into 6 bytes!)
The UTF-16 api probably needs to turn "\xAB" into 0x00AB, for compatibility.
It may be a good idea to disallow "\xNN" and "\u00NN" representations of ASCII letters and numbers, and perhaps a lot of punctuation, to prevent use of quoting for obfuscating text.

jordanbtucker commented 2 years ago

@spitzak Thank for you clarifying your points. I still think there is some confusion around the term "UTF-8 string", but I believe I have a better idea of what you're proposing, even if I still don't agree with it.

This seems to be an API feature request rather than a specification request, and if so, I would recommend that you open this request as an issue in the json5 reference implementation repo. If you're asking to add a section to the spec recommending that JSON5 libraries implement this API, then I recommend opening this request as a separate issue in this repo.

In either case, let's continue this discussion elsewhere since we're straying a bit too far from whether the spec should recommend avoiding invalid surrogate pairs.

json5 / json5-spec

Recommend avoiding invalid surrogate pairs #12