Strict mode for JSON parsing

eamonnmcmanus commented 1 year ago

As noted in the documentation for JsonReader.setLenient, there are a number of areas where Gson accepts JSON inputs that are not strictly conformant, even when lenient mode is off:

JsonReader allows the literals true, false and null to have any capitalization, for example fAlSe

JsonReader supports the escape sequence \', representing a '

JsonReader supports the escape sequence \LF (with LF being the Unicode character U+000A), resulting in a LF within the read JSON string

JsonReader allows unescaped control characters (U+0000 through U+001F)

To which we could add that it allows a trailing comma in JSON arrays, for example ["foo", "bar",] as noted in #494.

We could imagine that GsonBuilder would acquire a method setStrict(), mutually exclusive with setLenient() and that JsonReader would likewise acquire setStrict(boolean) and isStrict(). We can't make strict mode the default, for fear of breaking existing users, but we could recommend it.

Marcono1234 commented 1 year ago

that it allows a trailing comma in JSON ~~objects~~ arrays[^1]

As mentioned in https://github.com/google/gson/issues/494#issuecomment-1374945520 that only seems to apply to lenient mode.

[^1]: Trailing commas in JSON objects always cause a MalformedJsonException for me, regardless of whether the reader is lenient or not.

eamonnmcmanus commented 1 year ago

Thanks @Marcono1234, I've updated the earlier comment to talk about JSON arrays rather than objects.

eamonnmcmanus commented 1 year ago

Now I'm very confused. Apparently trailing commas are already disallowed in both JSON objects and JSON arrays? So the list quoted in my earlier comment is correct without addition.

Anyway I still do think we should have a strict mode, that you can easily get using GsonBuilder.

marten-voorberg commented 1 year ago

My project group and I will try to resolve this issue for a software engineering course at KTH, Sweden.

What should be the interaction between setLenient and setStrict? It seems very strange to us to be able to set both flags on the same Gson instance. Should we throw an exception (maybe IllegalStateException) if you try to set both of these flags?

eamonnmcmanus commented 1 year ago

@marten-voorberg, thanks for the proposal!

I think there are several things that need to happen here, not necessarily in order:

Determine the things that are accepted today even in non-lenient mode but that are not strictly conformant to the JSON spec. It may well be that the list in JsonReader.setLenient is complete, but it would be worth studying the ECMA specification closely. (It's fairly short.)
Write test cases for these things. I think some of them are already tested, with comments to the effect of "this should not be accepted but is". To start with, new test cases will presumably assert that a construct can be parsed, even though eventually we want to check that it can't in strict mode.
Decide what the API should be that allows us to turn on strict parsing. It will probably include changes to both GsonBuilder and JsonReader.
Implement the API and revise the tests so they cover strict mode correctly.

I've created a branch strictness to which you can send pull requests. That way the mainline won't have partial changes, but also you don't have to do everything in one giant PR.

Concerning the API, I am actually thinking it might be cleaner to use an enum. It might be enum Strictness {LENIENT, DEFAULT, STRICT}, for example. Then GsonBuilder and JsonReader could have something like setStrictness(Strictness). The existing setLenient() would be equivalent to setStrictness(Strictness.LENIENT), and setLenient(boolean lenient) would be equivalent to setStrictness(lenient ? Strictness.LENIENT : Strictness.DEFAULT).

Another more general approach would be to have an enum for the various areas of leniency, for example comments; arbitrary capitalization (fAlse); nonstandard escape sequences. It might look something like this:

public enum Leniency {
  COMMENTS, CAPITALIZATION, ESCAPES, ...;

  public static final Set<Leniency> LENIENT = EnumSet.of(...);
  public static final Set<Leniency> DEFAULT = EnumSet.of(...);
  public static final Set<Leniency> NONE = EnumSet.ofNone(Leniency.class);
}

Then setLenient() would be equivalent to setLeniency(Leniency.LENIENT), etc. We would have to look at everywhere that the code currently checks the lenient boolean and classify it into one or more of the new enum constants.

I'm not sure that this extra generality would really be worthwhile. It would allow users to specify exactly which non-standard constructs they need to accept, while rejecting everything else. But I'm inclined to think you either accept the exact standard or you accept any random crap (which is what lenient mode means today). The main reason for DEFAULT is that the current non-lenient mode is not strict enough.

marten-voorberg commented 1 year ago

@eamonnmcmanus Thanks for the response!

I've submitted a WIP PR which throws exceptions when trying to parse any of the non-spec compliant values discussed in the issue description.

The API for setStrict in this PR is definitely going to change. I think your suggestions for an enum Strictness {LENIENT, STRICT, DEFAULT} is a very good way to do this. I also agree with you that the the more general approach a leniency enum for (dis)allowing each non-spec compliant feature separately seems a bit overkill.

Marcono1234 commented 1 year ago

@marten-voorberg, maybe it would also be interesting to add unit tests for the following:

U+2028 and U+2029 handling ECMAScript considers them line terminators but JSON does not. So "\"\u2028\u2029\"" should be considered valid JSON, and "[\u2028]" respectively "[\u2029]" should be considered invalid (all written as Java string literals here).
Other control characters handling There are actually more control characters than defined in the JSON specification. However, since the JSON specification does not list them, they should be allowed unescaped in string values. For example the following should be considered valid JSON: "\"\u007F\u009F\""

marten-voorberg commented 1 year ago

The JsonWriter also has a lenient field. I think it makes sense to also give the JsonWriter a strictness which exhibits the following behavior:

Strictness.DEFAULT: The behavior is the same as the current lenient = false behavior.
Stictness.LENIENT: The same behavior as lenient = true currently has, namely allowing infinity and NaN to be written.
Strictness.STRICT: Currenlty, the reader will escape '\u2028' and '\u2029' but according to the spec, these should not be escaped. I propose that these characters will not be escaped in strict mode.

Marcono1234 commented 1 year ago

Currenlty, the reader will escape '\u2028' and '\u2029' but according to the spec, these should not be escaped.

That behavior is actually allowed by the specification, see RFC 8259, section 7[^1^]:

Any character may be escaped.

Maybe there are use cases where users don't want \u2028 and \u2029 to be escaped, but I would assume they are rather rare, and it would also not be related to strict mode (related: #1984).

[^1^]: While RFC 8259 and ECMA-404 are supposed to be equivalent (see RFC section 1.2), it would probably be better to stick to the RFC document since that is what documentation of Gson refers to already (though it refers to older RFCs). In case someone is interested in this blog post Tim Bray, the editor of the latest JSON RFCs, explains (subjectively?) the historical reason for ECMA-404.

eamonnmcmanus commented 1 year ago

I agree with @Marcono1234: we can escape \u2028 and \u2029 always. The point about Strictness applying to writes as well as reads is a good one, but for now it doesn't look as if there's any difference between DEFAULT and STRICT there. Nevertheless for consistency I think we should make it so jsonWriter.setLenient(b) is equivalent to jsonWriter.setStrictness(b ? LENIENT : DEFAULT) and so the lenient field is replaced by strictness.

I also agree that RFC 8259 is a better reference than the ECMA spec I linked to earlier. Only one small caveat: there is an erratum with a grammar that seems to suggest that / must be escaped in string literals, and that's clearly wrong. The remaining errata don't affect Gson as far as I can see.

I also found this excellent resource about tricky areas in JSON parsing, including an excellent set of tests.

Marcono1234 commented 1 year ago

Only one small caveat: there is an erratum with a grammar that seems to suggest that / must be escaped in string literals, and that's clearly wrong.

You mean the erratum itself is wrong, right? Because the current JSON grammar seems consistent with the text above of it, saying "the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters". So to my understanding, / can be escaped, but does not have to be.

including an excellent set of tests

Out of curiosity I have locally adjusted that project to run with the latest Gson version from master and to run also in strict mode; below is the code and the results, in case someone is interested. Malformed UTF-8 tests don't really apply since Gson works based on a Reader and not an InputStream. Looks like Gson passes most of the tests, except for the cases we already identified earlier.

(Hacky) JSONTestSuite integration and results

```java import java.io.*; import java.nio.charset.*; import java.nio.file.*; import com.google.gson.*; import com.google.gson.stream.*; public class Test { public static boolean parseJsonElement(String s, boolean lenient) throws IOException { Gson gson = new Gson(); TypeAdapter adapter = gson.getAdapter(JsonElement.class); JsonReader reader = new JsonReader(new StringReader(s)); reader.setLenient(lenient); try { adapter.read(reader); return reader.peek() == JsonToken.END_DOCUMENT; } catch (JsonParseException | MalformedJsonException | EOFException e) { e.printStackTrace(); return false; } } private static boolean jsonReaderSkip(String s) throws IOException { JsonReader reader = new JsonReader(new StringReader(s)); try { reader.skipValue(); return reader.peek() == JsonToken.END_DOCUMENT; } catch (JsonParseException | MalformedJsonException | EOFException e) { e.printStackTrace(); return false; } } public static void main(String[] args) { Path path = Paths.get(args[1]); String s = null; try { s = Files.readString(path); } catch (CharacterCodingException e) { System.out.println("Skipping testcase with malformed UTF-8 for: " + path); e.printStackTrace(); System.exit(0); } catch (Throwable t) { t.printStackTrace(); System.exit(2); } try { boolean isValid = false; if (args[0].equals("parse-lenient")) { isValid = parseJsonElement(s, true); } else if (args[0].equals("parse-strict")) { isValid = parseJsonElement(s, false); } else if (args[0].equals("skip")) { isValid = jsonReaderSkip(s); } else { System.out.println("Invalid arg: " + args[0]); System.exit(2); } if (isValid) { System.out.println("valid"); System.exit(0); } else { System.out.println("invalid"); System.exit(1); } } catch (Throwable t) { t.printStackTrace(); System.exit(2); } } } ``` Then adjust `run_tests.py` from JSONTestSuite to contain: ```python programs = { "gson-lenient": { "url":"...", "commands":["java", "-cp", "gson-2.10.2-SNAPSHOT.jar", "Test.java", "parse-lenient"] }, "gson-strict": { "url":"...", "commands":["java", "-cp", "gson-2.10.2-SNAPSHOT.jar", "Test.java", "parse-strict"] }, "gson-skip": { "url":"...", "commands":["java", "-cp", "gson-2.10.2-SNAPSHOT.jar", "Test.java", "skip"] } } ``` (assumes you are using JDK >= 11) Results: (cases with malformed UTF-8 are mostly irrelevant due to Gson working based on `Reader`; malformed Unicode escape being considered a parser crash is due to #2334) ![JSONTestSuite results](https://user-images.githubusercontent.com/11685886/222899360-94be4fbd-80f8-4200-a6b1-8ab8cb44cba5.png)

vitorpamplona commented 1 year ago

It would be nice to have a common behavior for escaping or not escaping \u2028 and \u2029 among all languages. I need to verify if the hash that somebody else created matches my hash of a JSON file. If their language decides to escape certain characters and mine doesn't, hashes won't match and we will have problems.

Marcono1234 commented 1 year ago

@vitorpamplona, do you really have to recreate the JSON document with Gson? Is there no way you can directly access the original JSON file and calculate the hash over that? Because there might also be other incompatibility issues, such as different floating point number parsing and conversion of floating point numbers to string between languages, and preserving the order of object properties.

Personally I think if such a setting is really needed it should not be tied to the strictness mode since this is not really related to strictness. Maybe it could be part of the newly added FormattingStyle; arguably configuring what to escape is more than just insignificant whitespace placement, but the content of the JSON data remains the same and for any compliant JSON parser it should not matter whether a certain character is escaped or not (unless of course the JSON specification requires the character to be escaped).

vitorpamplona commented 1 year ago

The issue is that the JSON is being passed around in multiple servers before it gets to me. It all depends on how those servers recreate that JSON in every step along the way. The JSON is just the serialization of the data model I have to work with. It shouldn't matter what they do. On my side, what I have to do is to receive the JSON, parse it into the data model, and rebuild the JSON as per the protocol's hashing spec (no escaping).

I bumped into this thread because today our app was rejecting a message with \u2028 escaped into a string and it took me 4 hours to figure out that Gson had hardcoded that escaping rule, ignoring the disable HTML escaping setting, which I was aware of.

marten-voorberg commented 1 year ago

@vitorpamplona I Agree with @Marcono1234 that I don't think that dealing with \u2028 and \u2029 really has anything to do with strictness. In the JSON spec does not define any special behaviour for these characters. They may be escaped, but they do not have to be escaped.

If we want to start treating \u2028 and \u2029 in some special way, I think we should introduce a new setting for this.

vitorpamplona commented 1 year ago

I will go back to my initial point. Whatever you choose to do, it would be nice to have a common behavior/setting name for escaping or not escaping \u2028 and \u2029 among all languages. At the least, there should be a setting to emulate what other libs/languages do so that we are not left with a big replaceAll clause after the json string has been created.

eamonnmcmanus commented 1 year ago

Some serialization formats do define a canonical form, so that the same input will always produce the exact same output. Plain JSON is not one of those formats. There are all sorts of things that can vary: you can insert whitespace between tokens; you can write a quote in a string as \" or as \u0022; you can change the order of name/value pairs within a JSON object.

There is a standard JSON Canonicalization Scheme (JCS), which defines exactly what bytes should be produced for any given object. However, I think that's different from what we're trying to do here. We could imagine later having a Strictness value that produces JCS on output and perhaps only accepts JCS on input. Or maybe this would be more appropriate as a FormattingStyle. Feel free to log a separate feature request if you think that would be useful. Gson might not be the best home for such a feature, though.

vitorpamplona commented 1 year ago

I agree, but like it or not, JSON is an extremely common canonicalization form. Most of us don't have any choice in the matter.

However, the issue here is not about an overarching canonical form for JSON. It's much, much simpler. It's simply a practically-helpful way of escaping Strings from the native representation of every language into the JSON String. That's it. No big dreams. All it takes is lib devs to come together and agree on 2-3 schemes, with commonly supported options, and ways to activate and deactivate them in the lib. It's not a standard. Just a lib implementation consensus.

eamonnmcmanus commented 1 year ago

It seems to me that the case you described requires a strict canonical format, which I think should be JCS. If we just canonicalize what string literals look like we're only solving half the problem.

vitorpamplona commented 1 year ago

Then solve half of the problem. I can guarantee you the other half is fully solved already :)

Marcono1234 commented 1 year ago

@eamonnmcmanus, this can be closed now since #2437 was merged, right? (I assume GitHub did not automatically close it because the first pull request targeted the strictness branch.)

Since this issue was only about parsing, I guess it would be good to create a separate issue if you want any change to how Gson emits JSON data, as mentioned above by Éamonn:

Feel free to log a separate feature request if you think that would be useful. Gson might not be the best home for such a feature, though.

google / gson

Strict mode for JSON parsing #2295