ietf-wg-httpapi / mediatypes

Other
5 stars 4 forks source link

Consider removing restated JSON subset requirements. #87

Closed sayrer closed 1 year ago

sayrer commented 1 year ago

I think it might be better to start Section 3.4 with a one sentence paragraph:

The YAML data model is a superset of that provided by JSON.

then conclude with the first two existing paragraphs of Section 3.4 YAML and JSON.

The reasoning here is that this document aims not to "imply a specific version", but the JSON mapping could change in the future. This has already happened between the various versions of JSON standards floating around out there.

There is also already a difference between YAML section 5.2 and this document:

On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.

But this document says:

* non UTF-8 encoding, since YAML supports UTF-16 and UTF-32 in addition to UTF-8;

You're saying these will cause problems (I agree they probably will...), but the YAML spec says they must be supported for JSON compatibility.

By the way, the most permissive standard, ECMA 404, says only "JSON syntax describes a sequence of Unicode code points." It says that to cover the stuff browsers might put in it, which might almost look like UTF-*, but can contain ill-formed byte-sequences.

ioggstream commented 1 year ago

Hi @sayrer, and thanks for your review!

wrt encoding

A timeline is required here.

So, YAML introduced UTF-32 for compatibility with a 2006 JSON specification. This is the reason we suggest UTF-8 for interoperability. Probably this timeline is worth an FAQ, but I don't think it can make it into the spec. I summon the Chairs here cc: @darrelmiller

wrt ECMA and JSON

I found this mail from Roy Fielding on URI RFC vs WHATWG very interesting in the part where it traces a line between rules to "parse" something, and rules to "interpret" something https://lists.w3.org/Archives/Public/ietf-http-wg/2022AprJun/0173.html

wrt YAML/JSON superset

We have been explicitly asked by the YAML community to not write that "YAML is a superset of JSON". I think @eemeli is the expert here, but I think the current wording reflects correctly the dialogue that we had with YAML folks.

sayrer commented 1 year ago

I think you should revisit this thinking. See Section 1.2 of the YAML spec. It says: "The YAML 1.2 specification was published in 2009. Its primary focus was making YAML a strict superset of JSON."

You are right about what RFC 8259 says, but this is not how JSON actually works. That RFC is a good one to follow if you are being "conservative in what you send", but of course there is the other part of Postel's law here.

Even if you disagree with all of the above, the encoding bullet point is wrong and needs to be edited.

The reason is right there in RFC 8259. "However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters;". I think it should have plainly said "because JavaScript does this...", but no one asked me.

The YAML spec insists on three kinds of UTF for JSON compatibility. This is actually not a superset of JSON for the reason I just gave. But, the data model is, which is why I used that term.

eemeli commented 1 year ago

As far as I can tell, the only place where YAML currently fails to be a superset of JSON is the way that for the latter RFC 8259 states:

The names within an object SHOULD be unique.

and ECMA-404 even more loosely:

The JSON syntax [...] does not require that name strings be unique

Meanwhile, according to the YAML spec:

[A mapping] represents an associative container, where each key is unique in the association and mapped to exactly one value.

To validate compatibility for my own library, I include a custom harness for the JSON Parsing Test Suite that only skips the test cases for map key uniqueness.

ioggstream commented 1 year ago

Hi @sayrer, please check if #88 clarifies your concerns.

sayrer commented 1 year ago

Yes, that's a lot better imho.

If a library passes the JSON Parsing Test Suite to the extent that it matches the "JavaScript" column, it's accepting invalid UTF-8 in some situations. This is why RFC 8259 mentions the ABNF carve out, and ECMA 404 says only that JSON is "a sequence of Unicode code points" (not characters).