json5 / json5-spec

The JSON5 Data Interchange Format
https://spec.json5.org
MIT License
49 stars 11 forks source link

Nitpickery about object keys and Unicode #49

Closed mqnc closed 6 months ago

mqnc commented 6 months ago

I want to translate the JSON5 reference implementation from js to C++ because I am not happy with the C++ versions out there. I have some points on the definition for keys in JSON5.

In the JSON5 spec, it only says:

Object keys may be an ECMAScript 5.1 IdentifierName.

The ECMAScript 5.1 spec says:

An Identifier is an IdentifierName that is not a ReservedWord.

According to the reference implementation, function is a valid object key but not a valid ECMAScript 5.1 identifier name.

The spec further states:

Two IdentifierName that are canonically equivalent according to the Unicode standard are not equal unless they are represented by the exact same sequence of code units (in other words, conforming ECMAScript implementations are only required to do bitwise comparison on IdentifierName values).

This is also not true for JSON5 keys according to the reference implementation.

Last point:

ECMAScript implementations may recognise identifier characters defined in later editions of the Unicode Standard. If portability is a concern, programmers should only employ identifier characters defined in Unicode 3.0.

If I understood correctly, the reference implementation uses Unicode 10.0. Will it stay like this and I should also use Unicode 10.0 for compatibility? Or should I use 3.0 or 15.0 instead and keep updating?

jordanbtucker commented 6 months ago

Thanks for checking in. It's great that you're making a C++ implementation!

Regarding the identifier topic, you're correct that function is an IdentifierName but not an Identifier. However, both ES5 and JSON5 use the IdentifierName production in the grammar for PropertyName and JSON5MemberName respectively. This means that function is a valid key in both ES5 and JSON5 objects.

Can you please elaborate on the following? In what cases does the reference implementation not follow this?

The spec further states:

Two IdentifierName that are canonically equivalent according to the Unicode standard are not equal unless they are represented by the exact same sequence of code units (in other words, conforming ECMAScript implementations are only required to do bitwise comparison on IdentifierName values).

This is also not true for JSON5 keys according to the reference implementation.

As far as compatibility, my recommendation would be to use Unicode 10 since that is what the reference implementation uses and there are no plans to upgrade the version of Unicode at this time.

mqnc commented 6 months ago

Regarding the identifier topic, you're correct that function is an IdentifierName but not an Identifier.

Touché!

If I understand the ECMA spec correctly, \u0078 is considered to be different from x. So {\u0078:1, x:2} would be ok. In the reference implementation, those result in the same key tho (I think). However, the reference goes on

The intent is that the incoming source text has been converted to normalised form C before it reaches the compiler.

Maybe I just don't really get that point...

Could you maybe specify Unicode 10 in the reference?

jordanbtucker commented 6 months ago

As far as I understand the ES5.1 specification, that paragraph has to do with Unicode normalization and not character escapes. Since there is more than one way to represent the same canonical Unicode code points at the 16-bit code unit level according to the Unicode standard, the ES5.1 spec does not burden implementations with the task of determining whether two unique 16-bit code unit sequences represent the same canonical Unicode code points. Instead, the source code is expected to be normalized via Normalization Form Canonical Composition (NFC) before it is processed by the ES implementation. Later specifications removed this requirements since they process source code as Unicode code points instead of 16-bit code units. JSON5 also excludes this requirement for the same reason.

That being said, the paragraph before that one does cover character escapes, and it explains that IdentifierNames must be compared for equivalency after the character escapes have been converted to characters. For example, the IdentifierName x is equivalent to \u0078, so they should be treated as the same IdentifierName.

All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters.

Note that JSON5 does not put any restrictions on duplicate keys at the specification level, but does mention that duplicate keys should be avoided for interoperability. The reference implementation follows the same behavior as ES and silently overwrites duplicate keys, the last key in document order taking precedence.

As for including a mention of Unicode 10 in the reference implementation, I'd be okay with adding that to the README.

mqnc commented 6 months ago

Ooooh thanks, now I got it. Who would have thought that in the end it turns out that you were right all along and I was wrong 😜. I shouldn't have made it sound like I knew what I was talking about. I thought

actual characters regardless of whether or not an escape sequence was used

refers to the byte stream of chars, especially as it says bitwise comparison later.

I will use this as a reference then, let's see if we accept the same strings.