json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.79k stars 266 forks source link

7bit, 8bit, binary don't make sense for contentEncoding #803

Closed janssk1 closed 4 years ago

janssk1 commented 5 years ago

https://json-schema.org/understanding-json-schema/reference/non_json_data.html

defines 7bit, 8bit, binary, quoted-printable and base64 as values for contentEncondign and refers to RFC 2054, part 6.1 to justify that list.

However, that particular RFC has a slighly different goal. It defines how to optionally transfer binary octets to characters. However, for the json schema case, the binary data must be encoded for it to fit in a json structure, so 7bit, 8bit and binary don't make sense.

PS: It would be good to include hex as a encoding option as well

handrews commented 5 years ago

Content encoding now also references RFC 4648 already, so that takes care of hex (among other things).

As for RFC 2045 (not 2054), that being the reference for this keyword long predates the involvement of any of the current specification editors. As we are likely to publish the next draft this week (probably tomorrow, in fact), I'm not going to make a backwards-incompatible change of removing possible values for the field right now.

In any event, the point of that field is to tell you that the string in question was encoded that way. The spec no longer treats this field as a sort-of ambiguous validation constraint. It is now strictly an annotation- the application will need to decode the string so they can "validate" it that way (the security considerations note the potential problems with this).

So I think this works fine as-is, but I'll leave this open and you're welcome to make a further case for changing it in the next draft (there will be at least one more).

janssk1 commented 5 years ago

Thanks for the info. RFC 4648 does seem a better reference. IMHO, dropping the reference to RFC 2054 (and associated 7bit, 8bit, binary) will remove the confusion. But i understand this may break backward compatiblity (although i doubt anybody will be using 8bit etc, since it's not clear what it actually means for a json string)

handrews commented 4 years ago

@janssk1 wow this is a mess, thanks for calling this out.

So, to summarize, as best as I can figure out:

I'm going to pause here so I don't lose all of this, more comments shortly.

handrews commented 4 years ago

I think what we want to do here is emphasize RFC 4648, as it is the most relevant to modern web apps shoving content through JSON. Which people do sometimes do. For Reasons.

The place where quoted-printable or the specific behavior of base64 as found in RFC 2045 is if the JSON is being used with email or otherwise demands old-school MIME compatibility. I don't know how often that is an issue, but that is something that comes up in Hyper-Schema (note the use of "contentMediaType" to work with one section of a multipart email body).

So I think what I want to do is:

Or something like that. Thoughts?

janssk1 commented 4 years ago

I would suggest to only mention RFC 4648 (the base* encodings) You can also mention that it's allowed to use other values for contentEnconding but then it's up to the user to know what they mean. I'm sure there exist other weird characters encodings apart from 'quoted-printable' and referencing just this one is unnecessary and distracting.

handrews commented 4 years ago

@janssk1 while I'm sure there are other "weird" character encodings, quoted-printable is a very well-established one that's been referenced by JSON Schema from the beginning, and I'm reluctant to drop it. Particularly because the implementation burden is essentially non-existent. A JSON Schema implementation need only inform the calling application about the encoding. It does not need to provide encoding/decoding services.

I don't really see much downside to leaving it in.

michael-o commented 4 years ago

To add some fuel to the fire, we define the following in OpenAPI:

{
  "di": 2000,
  "da": 2100,
  "mat": "steel",
  "dwg_template": "cid:sfsf@sdgfs"
}

Where the CID points to the binary data.

handrews commented 4 years ago

@michael-o that looks like a URI (URN?) of some sort which is very different from what these keywords are doing. These keywords are only for content embedded in JSON Strings, not references to contents.

michael-o commented 4 years ago

@handrews , that is correct. RFC 2392. Wir are sending multipart/related where we use CID URLs to reference out of message information. Mostly binary data which cannot be transported in JSON. This is basically the same approach was with SOAP MTOM.

janssk1 commented 4 years ago

@handrews , i noticed that the link https://json-schema.org/understanding-json-schema/reference/non_json_data.html is not updated. It still contains the same text as reported in this ticket.

janssk1 commented 4 years ago

Also, how should someone you express in json schema that a certain string contains base64'ed gzipped binary data ? In HTTP, the contentEncoding header is used to express what kind of (binary) compression is used. In json schema, the contentEncoding expresses what encoding is used to transfer binary data to (UTF-8) text. Should there be another header in JSON schema to model compression as well (eg contentCompression) ?

michael-o commented 4 years ago

@janssk1 I think this is a transport issue, not related to the schema. You can turn that into an object having value, compression properties. Same as in Content-Encoding header.

michael-o commented 4 years ago

At the end, Base 64 is the worst approach. We are externalizing binary resources with multipart/related messages.

janssk1 commented 4 years ago

@michael-o If the complete json document is gzipped, then it is indeed a transport issue. I'm talking about the case where just one field of the json contains gzipped, base64'ed content.

If i understand you correctly, you are suggesting to explicitly manage the compression in json, not json schema. Example below. That's possible, that means you have to pollute data messages with meta information, something a schema should do AFAIK.

{"some": {"normal": "json"}, "my-data": "", "my-data-compression": "gzip" }

michael-o commented 4 years ago

Yes, you will to pollute it unfortunately, because you don't a separation like in HTTP headers and body. What you also could do it to detect compression on the fly by peeking magic bytes.

handrews commented 4 years ago

@janssk1 Understanding JSON Schema, while loosely under the JSON Schema Organization umbrella, has its own repository and maintainers. This repository is only for the specification itself. Understanding JSON Schema also only covers up to draft-07, currently, I think, so any change to a forthcoming draft will of course not be reflected there yet.

As for your gzip issue, JSON Schema does not claim to have a keyword for every possible use case of putting things in JSON. This is why, as of draft 2019-09, we have added features to support modular, re-usable extension vocabularies. As 2019-09 (or later) implementations become available, we are encouraging folks to design extension keywords to handle use cases that are not covered by the core and validation specification documents. If you want to request a keyword in a vocabulary, there is a JSON Schema Vocabularies repository where you can file that.
Note that we (the JSON Schema Organization) are not building extension vocabularies. We are trying to finalize the core and validation specification into standards. We are just providing a repository where people can park ideas for keywords and other people who would like to collaborate on extensions can find those ideas.

Please, if relevant, file the appropriate issues with the appropriate repository. This issue is closed and should not be used for additional concerns. You are also welcome to join our slack workspace to discuss things in more detail (there's a "Join Our Slack" link on json-schema.org in the upper right corner). GitHub issues are not a great place for a lot of back-and-forth on things.