String normalization - Githubissues

cburgmer commented 3 years ago

When looking up object member names (dot selector or index selector) normalization of strings becomes an issue. See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization for an introduction on normalization in Unicode.

As indicated with one test in the comparison project (https://cburgmer.github.io/json-path-comparison/#bracket_notation_with_NFC_path_on_NFD_key) it seems though that there is consensus not to respect equivalent code points. The Raku implementation is the only one that matches the equivalent key, which seems to stem from the Raku runtime already normalizing strings upfront (if my manual tests are to be believed).

Is this something we want to prescribe in a standard, or would we end up with advice possibly contradicting the implementation of JSON parsers? (A quick search turned up http://seriot.ch/parsing_json.php which complains about different implementation in parsers with respect to string normalization.)

gregsdennis commented 3 years ago

I believe there's already a convention that bracketed syntax ([1] and ['foo']) is the normalized syntax.

Secondarily, where should these normalized paths be used/enforced? Surely input should handle any form. However output (i.e. paths to matches) should be normalized.

danielaparker commented 3 years ago

@cburgmer, it's an interesting issue.

I was interested in how XPath 3.1 handles normalization, discussed here, and it appears to have very limited support, remarking that "Unless explicitly stated, the xs:string values returned by the functions in this document are not normalized in the sense of Character Model for the World Wide Web 1.0: Fundamentals.

XPath 3.1 does have an fn:normalize-unicode function, which applies unicode normalization to a string. It has a two argument overload that allows the caller to set the Normalization Form, with five different kinds of normalization. I suppose we could use such a function to compare string values in filter expressions.

As for looking up property values in a JSON object, I think many JSONPath implementations use functions provided by another library, such as JsonElement.TryGetProperty in .Net. And it seems from your test that they don't generally support normalization.

gregsdennis commented 3 years ago

Wait, are we discussing Unicode normalization or Path normalization? I think my comment may have been off-topic.

cabo commented 3 years ago

When discussing normalisation of text in this context, it is useful to have read https://www.w3.org/TR/charmod-norm/

(I'm assuming that we are not using the once popular model where processing entails text-normalising everything in sight first and then working with that, but that text normalisation occurs only as part of comparison.)

It seems that text comparison is relevant to JSONPath in two places:

indexing of JSON objects
expression language equality/equivalence

Depending on your view what JSON objects are good for, indexing may have a normalisation requirement or you are only ever indexing with known vocabulary (which is then easy to pre-normalize).

Expression language equivalence cannot ignore the problem as easily, or you won't find Jörg if you have a Jörg (sorry, this distinction doesn't seem to survive here).

cburgmer commented 3 years ago

I've added another test for filter expressions with string equality. It shows two implementations employ some form of normalization when checking for equality: https://cburgmer.github.io/json-path-comparison/results/filter_expression_with_equals_string_in_NFC.html (see Mot\u00f6rhead and Moto\u0308rhead), while (again) consensus seems to be to not normalize.

cabo commented 3 years ago

I'm relieved that the consensus appears to be to use byte string comparison. (I'm too dense to find out which ones are the implementations that do employ some normalization.)

That does not mean, though, that we couldn't add a separate operator for normalizing comparison.

danielaparker commented 3 years ago

I've added another test for filter expressions with string equality ... consensus seems to be to not normalize.

As a practical matter, I think that the issue would most likely be felt in this context, rather than with key lookup. Keys are usually chosen judiciously so as not to cause problems.

For JSONPath implementations in interpreted languages such as JavaScript, Python or PHP, and whose filter expressions are these languages, they already have functions that return the Unicode Normalization Form of strings, see e.g. JavaScript String.prototype.normalize, Python unicodedata.normalize, and PHP normalizer.normalize.

Perhaps other implementations that implement their own expression evaluators could support a function like XPath's fn:normalize-unicode.

For another comparison, I looked at JMESPath, and it specifically states that "All string related functions are defined on the basis of Unicode code points; they do not take normalization into account." There is a community developed JMESPath extension library that supports other functions, but I don't think any of these provide normalization functionality.

danielaparker commented 3 years ago

For one more comparison, I looked at JSONiq, an ISO/IEC approved, OASIS standard. JSONiq supports the XPath and XQuery function fn:normalize-unicode.

remorhaz commented 3 years ago

I'd like to cite JSON RFC-8259, section 8.3:

8.3. String Comparison

Software implementations are typically required to test names of object members for equality. Implementations that transform the textual representation into sequences of Unicode code units and then perform the comparison numerically, code unit by code unit, are interoperable in the sense that implementations will agree in all cases on equality or inequality of two strings. For example, implementations that compare strings with escaped characters unconverted may incorrectly find that "a\b" and "a\u005Cb" are not equal.

So, JSON considers strings equal if they consist of same sequences of Unicode code units. In fact that means that valid JSON object can have two property keys that will become equal if normalized. If we enforce normalization in JSONPath, we cannot address each of those properties separately.

cabo commented 2 years ago

Task: Need to write up 112 output:

Consensus: Normalization not part of spec, codepoint comparison should be used

(Note that codepoint comparison is the same as byte string comparison when the byte string carries UTF-8.)

timbray commented 2 years ago

Please, no normalization. On that path lies madness.

timbray commented 2 years ago

I think, however, that there should be a note in the text (if there isn't already, haven't checked) making it explicit that implementations MUST NOT normalize before comparing member names and string values.

timbray commented 2 years ago

Fixed in in 822fd6d3d5d55d1ae1637498b28df086ee9f9c72

ietf-wg-jsonpath / draft-ietf-jsonpath-base

String normalization #117