Closed eamonnmcmanus closed 1 year ago
that it allows a trailing comma in JSON
objectsarrays[^1]
As mentioned in https://github.com/google/gson/issues/494#issuecomment-1374945520 that only seems to apply to lenient mode.
[^1]: Trailing commas in JSON objects always cause a MalformedJsonException
for me, regardless of whether the reader is lenient or not.
Thanks @Marcono1234, I've updated the earlier comment to talk about JSON arrays rather than objects.
Now I'm very confused. Apparently trailing commas are already disallowed in both JSON objects and JSON arrays? So the list quoted in my earlier comment is correct without addition.
Anyway I still do think we should have a strict mode, that you can easily get using GsonBuilder
.
My project group and I will try to resolve this issue for a software engineering course at KTH, Sweden.
What should be the interaction between setLenient
and setStrict
? It seems very strange to us to be able to set both flags on the same Gson instance. Should we throw an exception (maybe IllegalStateException
) if you try to set both of these flags?
@marten-voorberg, thanks for the proposal!
I think there are several things that need to happen here, not necessarily in order:
JsonReader.setLenient
is complete, but it would be worth studying the ECMA specification closely. (It's fairly short.)GsonBuilder
and JsonReader
.I've created a branch strictness
to which you can send pull requests. That way the mainline won't have partial changes, but also you don't have to do everything in one giant PR.
Concerning the API, I am actually thinking it might be cleaner to use an enum
. It might be enum Strictness {LENIENT, DEFAULT, STRICT}
, for example. Then GsonBuilder
and JsonReader
could have something like setStrictness(Strictness)
. The existing setLenient()
would be equivalent to setStrictness(Strictness.LENIENT)
, and setLenient(boolean lenient)
would be equivalent to setStrictness(lenient ? Strictness.LENIENT : Strictness.DEFAULT)
.
Another more general approach would be to have an enum
for the various areas of leniency, for example comments; arbitrary capitalization (fAlse
); nonstandard escape sequences. It might look something like this:
public enum Leniency {
COMMENTS, CAPITALIZATION, ESCAPES, ...;
public static final Set<Leniency> LENIENT = EnumSet.of(...);
public static final Set<Leniency> DEFAULT = EnumSet.of(...);
public static final Set<Leniency> NONE = EnumSet.ofNone(Leniency.class);
}
Then setLenient()
would be equivalent to setLeniency(Leniency.LENIENT)
, etc. We would have to look at everywhere that the code currently checks the lenient
boolean and classify it into one or more of the new enum
constants.
I'm not sure that this extra generality would really be worthwhile. It would allow users to specify exactly which non-standard constructs they need to accept, while rejecting everything else. But I'm inclined to think you either accept the exact standard or you accept any random crap (which is what lenient mode means today). The main reason for DEFAULT
is that the current non-lenient mode is not strict enough.
@eamonnmcmanus Thanks for the response!
I've submitted a WIP PR which throws exceptions when trying to parse any of the non-spec compliant values discussed in the issue description.
The API for setStrict
in this PR is definitely going to change. I think your suggestions for an enum Strictness {LENIENT, STRICT, DEFAULT}
is a very good way to do this. I also agree with you that the the more general approach a leniency enum for (dis)allowing each non-spec compliant feature separately seems a bit overkill.
@marten-voorberg, maybe it would also be interesting to add unit tests for the following:
U+2028
and U+2029
handling
ECMAScript considers them line terminators but JSON does not. So "\"\u2028\u2029\""
should be considered valid JSON, and "[\u2028]"
respectively "[\u2029]"
should be considered invalid (all written as Java string literals here)."\"\u007F\u009F\""
The JsonWriter also has a lenient
field. I think it makes sense to also give the JsonWriter a strictness
which exhibits the following behavior:
Strictness.DEFAULT
: The behavior is the same as the current lenient = false
behavior.Stictness.LENIENT
: The same behavior as lenient = true
currently has, namely allowing infinity and NaN to be written. Strictness.STRICT
: Currenlty, the reader will escape '\u2028' and '\u2029' but according to the spec, these should not be escaped. I propose that these characters will not be escaped in strict mode. Currenlty, the reader will escape '\u2028' and '\u2029' but according to the spec, these should not be escaped.
That behavior is actually allowed by the specification, see RFC 8259, section 7[^1^]:
Any character may be escaped.
Maybe there are use cases where users don't want \u2028
and \u2029
to be escaped, but I would assume they are rather rare, and it would also not be related to strict mode (related: #1984).
[^1^]: While RFC 8259 and ECMA-404 are supposed to be equivalent (see RFC section 1.2), it would probably be better to stick to the RFC document since that is what documentation of Gson refers to already (though it refers to older RFCs). In case someone is interested in this blog post Tim Bray, the editor of the latest JSON RFCs, explains (subjectively?) the historical reason for ECMA-404.
I agree with @Marcono1234: we can escape \u2028
and \u2029
always. The point about Strictness
applying to writes as well as reads is a good one, but for now it doesn't look as if there's any difference between DEFAULT
and STRICT
there. Nevertheless for consistency I think we should make it so jsonWriter.setLenient(b)
is equivalent to jsonWriter.setStrictness(b ? LENIENT : DEFAULT)
and so the lenient
field is replaced by strictness
.
I also agree that RFC 8259 is a better reference than the ECMA spec I linked to earlier. Only one small caveat: there is an erratum with a grammar that seems to suggest that /
must be escaped in string literals, and that's clearly wrong. The remaining errata don't affect Gson as far as I can see.
I also found this excellent resource about tricky areas in JSON parsing, including an excellent set of tests.
Only one small caveat: there is an erratum with a grammar that seems to suggest that
/
must be escaped in string literals, and that's clearly wrong.
You mean the erratum itself is wrong, right? Because the current JSON grammar seems consistent with the text above of it, saying "the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters". So to my understanding, /
can be escaped, but does not have to be.
including an excellent set of tests
Out of curiosity I have locally adjusted that project to run with the latest Gson version from master
and to run also in strict mode; below is the code and the results, in case someone is interested. Malformed UTF-8 tests don't really apply since Gson works based on a Reader
and not an InputStream
. Looks like Gson passes most of the tests, except for the cases we already identified earlier.
It would be nice to have a common behavior for escaping or not escaping \u2028
and \u2029
among all languages. I need to verify if the hash that somebody else created matches my hash of a JSON file. If their language decides to escape certain characters and mine doesn't, hashes won't match and we will have problems.
@vitorpamplona, do you really have to recreate the JSON document with Gson? Is there no way you can directly access the original JSON file and calculate the hash over that? Because there might also be other incompatibility issues, such as different floating point number parsing and conversion of floating point numbers to string between languages, and preserving the order of object properties.
Personally I think if such a setting is really needed it should not be tied to the strictness mode since this is not really related to strictness. Maybe it could be part of the newly added FormattingStyle
; arguably configuring what to escape is more than just insignificant whitespace placement, but the content of the JSON data remains the same and for any compliant JSON parser it should not matter whether a certain character is escaped or not (unless of course the JSON specification requires the character to be escaped).
The issue is that the JSON is being passed around in multiple servers before it gets to me. It all depends on how those servers recreate that JSON in every step along the way. The JSON is just the serialization of the data model I have to work with. It shouldn't matter what they do. On my side, what I have to do is to receive the JSON, parse it into the data model, and rebuild the JSON as per the protocol's hashing spec (no escaping).
I bumped into this thread because today our app was rejecting a message with \u2028
escaped into a string and it took me 4 hours to figure out that Gson had hardcoded that escaping rule, ignoring the disable HTML escaping setting, which I was aware of.
@vitorpamplona I Agree with @Marcono1234 that I don't think that dealing with \u2028
and \u2029
really has anything to do with strictness. In the JSON spec does not define any special behaviour for these characters. They may be escaped, but they do not have to be escaped.
If we want to start treating \u2028
and \u2029
in some special way, I think we should introduce a new setting for this.
I will go back to my initial point. Whatever you choose to do, it would be nice to have a common behavior/setting name for escaping or not escaping \u2028
and \u2029
among all languages. At the least, there should be a setting to emulate what other libs/languages do so that we are not left with a big replaceAll clause after the json string has been created.
Some serialization formats do define a canonical form, so that the same input will always produce the exact same output. Plain JSON is not one of those formats. There are all sorts of things that can vary: you can insert whitespace between tokens; you can write a quote in a string as \"
or as \u0022
; you can change the order of name/value pairs within a JSON object.
There is a standard JSON Canonicalization Scheme (JCS), which defines exactly what bytes should be produced for any given object. However, I think that's different from what we're trying to do here. We could imagine later having a Strictness
value that produces JCS on output and perhaps only accepts JCS on input. Or maybe this would be more appropriate as a FormattingStyle
. Feel free to log a separate feature request if you think that would be useful. Gson might not be the best home for such a feature, though.
I agree, but like it or not, JSON is an extremely common canonicalization form. Most of us don't have any choice in the matter.
However, the issue here is not about an overarching canonical form for JSON. It's much, much simpler. It's simply a practically-helpful way of escaping Strings from the native representation of every language into the JSON String. That's it. No big dreams. All it takes is lib devs to come together and agree on 2-3 schemes, with commonly supported options, and ways to activate and deactivate them in the lib. It's not a standard. Just a lib implementation consensus.
It seems to me that the case you described requires a strict canonical format, which I think should be JCS. If we just canonicalize what string literals look like we're only solving half the problem.
Then solve half of the problem. I can guarantee you the other half is fully solved already :)
@eamonnmcmanus, this can be closed now since #2437 was merged, right?
(I assume GitHub did not automatically close it because the first pull request targeted the strictness
branch.)
Since this issue was only about parsing, I guess it would be good to create a separate issue if you want any change to how Gson emits JSON data, as mentioned above by Éamonn:
Feel free to log a separate feature request if you think that would be useful. Gson might not be the best home for such a feature, though.
As noted in the documentation for
JsonReader.setLenient
, there are a number of areas where Gson accepts JSON inputs that are not strictly conformant, even when lenient mode is off:To which we could add that it allows a trailing comma in JSON arrays, for example
["foo", "bar",]
as noted in #494.We could imagine that
GsonBuilder
would acquire a methodsetStrict()
, mutually exclusive withsetLenient()
and thatJsonReader
would likewise acquiresetStrict(boolean)
andisStrict()
. We can't make strict mode the default, for fear of breaking existing users, but we could recommend it.