leadpony / jsonp-test-suite

Test Suite for implementations of Jakarta JSON Processing API (JSON-P)
Apache License 2.0
3 stars 2 forks source link

String comparisons should normalize to ASCII #4

Closed ssilverman closed 3 years ago

ssilverman commented 3 years ago

Some implementations output "\uXXXX" variants for non-ASCII characters (eg. the "G-clef" example, "\ud834\udd1e"). It might be more useful, in the tests that compare strings, to normalize to ASCII-only before the comparison. Another example: the em-dash (U+2014) in atom-api.json.

ssilverman commented 3 years ago

I just realized that the G_CLEF string may be wrong. See: https://github.com/leadpony/jsonp-test-suite/issues/6 This is likely the real cause of this issue.

ssilverman commented 3 years ago

Also related:

These are all cases (there may be more, I haven't scoured anything) where the JSON-encoded string result is compared with a decoded string result, i.e. "\u2014" is being compared with "—".

leadpony commented 3 years ago

@ssilverman

This test suite is targeting JSON-P API, which is now called as Jakarta JSON Processing API. This test suite is not a set of general tests targeting libraries such as Jackson or Gson.

Because the Reference Implementation outputs Unicode characters beyond ASCII code as-is, not Unicode-escaped, any implementation conforming to JSON-P API, including yours, should do so.

What you are doing is loosening the strictness offered by the tests, just for your implementation.

ssilverman commented 3 years ago

I think we can agree that {"A": "a"} and {"A": "\u0061"} are identical JSON objects, after parsing. The only difference is their JSON-encoded representation.

These tests that don't normalize are assuming that the JSON objects above are not identical. The tests, then, shouldn't say that they're testing eg. "JsonWriter" (or whatever object type). They need to state that they're actually testing "org.glassfish.json.JsonWriterImpl".

By saying that it is "loosening the strictness" to say that the above two JSON objects are equal is not something I can agree with. But I do hear your point about testing the implementation and not the JSON-P API spec. Perhaps differentiating the tests into the following would be useful:

  1. Tests for JSON-P API, and
  2. Tests for the Glassfish implementation of the JSON-P API because there are some differences.
leadpony commented 3 years ago

@ssilverman Yes, we can agree that {"A": "a"} and {"A": "\u0061"} are identical JSON object, in general.

The primary purpose of this test suite is minimizing differences in behavior between all implementations of JSON-P API, especially between the Reference Implementation and my own implementation. If users of the JSON-P API switch from the Reference Implementation to Joy and they found any small changes in behavior, except bug fix and/or better performance, they feel disappointed.

If you really want to output all Unicode characters as escaped, it should be optionally enabled via an implementation-specific configuration property. I feel sad if my name, written in Kanji Characters, is always escaped in the outputted JSON. So not-escaping is appropriate default configuration, I believe.

ssilverman commented 3 years ago

We're not talking about the display of characters, we're talking about internal representation. The two objects, {"A": "a"} and {"A": "\u0061"} are equivalent (the same goes for all Kanji characters); these tests are saying they're not, even if you say in your comment you agree they're the same.

If I were displaying a Unicode string, I would make sure that the correct characters are being used before displaying them, possibly including Unicode normalization and canonicalization. When a JSON string is displayed, if a developer doesn't use the correct characters for display, then that would make me sad too. But that's not what we're talking about. We're talking about a JSON encoding of the characters.

In other words, I would do: display(processUnicode(jsonString.getString())). I would not do: display(jsonString.getString()). One of the things processUnicode would do is un-escape "\u" sequences into their correct characters, especially with JSON from unknown origins. Now, the reference implementation happens to be doing that step when it reads the string. Other implementations may not. JSON-P does not require this, and it's stated nowhere that "an implementation must always store un-escaped Unicode in JSON strings." If it did, it's no longer a generalized spec, it's dictating how to implement. Otherwise, JSON-P isn't a specification, it's a "how Glassfish does it."

The JsonString.getString() docs state:

Returns the JSON string value.

That's it. It doesn't state: Returns the JSON string value with all characters un-escaped.

JSON output is meant for encoding, not display. These tests are effectively saying that {"A": "a"} and {"A": "\u0061"} are not the same, which is not correct.

I'm curious, why are these tests labeled as "JSON-P tests", and not "Glassfish tests"? Many of the assumptions made don't agree with the JSON-P text.

leadpony commented 3 years ago
JsonString s = Json.createValue("©");
System.out.println(s.toString());

Implementation X outputs:

"©"

Implementation Y outputs:

"\u00a9"

You think this discrepancy should be allowed. I think this discrepancy should not be allowed.

ssilverman commented 3 years ago

I do understand the choice made, despite our philosophical differences. but only for the string access and not for the JSON output.

Your example above isn’t about the issue I’m talking about.

I’m not disagreeing that toString() could contain unescaped characters. I’m strongly disagreeing that the JSON output comparison treats {"A": "a"} and {"A": "\u0061"} differently. That is absolutely incorrect.

ssilverman commented 3 years ago

I will leave this, though: I believe this is still wrong because the bad comparisons are done on JSON output, not on the raw strings.

I.e. The results of a string comparison for JSON output should be normalized, but the direct results of getString() don’t have to be.

ssilverman commented 3 years ago

I'll add: I've been thinking about this more, and I think the point I've not been getting across properly is that there's a difference between JSON output and string values. Maybe I didn't get that across in a clear way earlier. I still think this needs to be addressed, but now I'm more clear on what exactly I think is the bug. Thanks for your patience with all these notes and discussions.