gibson042 / canonicaljson-spec

Specification of canonical-form JSON for equivalence comparison.
http://gibson042.github.io/canonicaljson-spec
19 stars 9 forks source link

Canonicalization of JSON does not specify Unicode Normalization Form #8

Closed jhellingman closed 4 years ago

jhellingman commented 4 years ago

This specification of canonical JSON does specify the use of UTF-8, however, it does not does not specify any Unicode Normalization form (See https://unicode.org/reports/tr15/). This means that two equivalent messages can still have different representations. Since implementations of Unicode normalization are widely available, it isn't very hard to add a requirement to use the NFC (Normalization Form C) of Unicode. NFC is the form that has the least impact on pre-existing messages.

simon-greatrix commented 4 years ago

If a JSON String is intended to represent a genuine piece of text, then I believe it should be in a specified normalization form, but I also believe that is beyond the scope of this specification.

The purpose of this specification is to define a stable canonical form for any legal JSON document and that neither changes nor limits the values that can be expressed.

JSON allows any sequence of Unicode code points in Strings, including code points not assigned yet. If a JSON document contains a code point that was later assigned a character value that was also affected by normalization, then the canonical form of the JSON document would change.

Furthermore, normalization of Strings could change the meaning of a document, which would be bad. The specification does require that "numbers" are represented in a specific way, but this does not change their value when they are interpreted as decimal numeric values.

To require the use of a Unicode Normalization Form may change the interpretation of a String value. The JSON specification in RFC-8259 states:

8.3 String Comparison Software implementations are typically required to test names of object members for equality. Implementations that transform the textual representation into sequences of Unicode code units and then perform the comparison numerically, code unit by code unit, are interoperable in the sense that implementations will agree in all cases on equality or inequality of two strings.

This means that an implementation that considers the same text normalized into Unicode in different ways as equal will NOT be interoperable with those that compare as described above. It also means implementations are free to consider normalization if they want to.

A JSON document that used different normalized forms of the same piece of text as object member names would be considered to have different member names by implementations that follow the hint above. An implementation that normalized text would see the same field name used multiple times, and might therefore lose the ability to distinguish between them.

So: normalization limits the values expressed, and normalization is not guaranteed to be stable, and normalization may change the meaning of the document.

As I said at the start, if you are sending text it should be normalized, but that should happen before you put it into your JSON, not as part of your canonicalization of the JSON.

jhellingman commented 4 years ago

Thanks, that is a clear separation of concerns.