Closed janssk1 closed 4 years ago
Content encoding now also references RFC 4648 already, so that takes care of hex (among other things).
As for RFC 2045 (not 2054), that being the reference for this keyword long predates the involvement of any of the current specification editors. As we are likely to publish the next draft this week (probably tomorrow, in fact), I'm not going to make a backwards-incompatible change of removing possible values for the field right now.
In any event, the point of that field is to tell you that the string in question was encoded that way. The spec no longer treats this field as a sort-of ambiguous validation constraint. It is now strictly an annotation- the application will need to decode the string so they can "validate" it that way (the security considerations note the potential problems with this).
So I think this works fine as-is, but I'll leave this open and you're welcome to make a further case for changing it in the next draft (there will be at least one more).
Thanks for the info. RFC 4648 does seem a better reference. IMHO, dropping the reference to RFC 2054 (and associated 7bit, 8bit, binary) will remove the confusion. But i understand this may break backward compatiblity (although i doubt anybody will be using 8bit etc, since it's not clear what it actually means for a json string)
@janssk1 wow this is a mess, thanks for calling this out.
So, to summarize, as best as I can figure out:
contentEncoding
is a bit convoluted
binaryEncoding
property of the media
object (the current editors took over after draft-04)contentEncoding
in draft-07Content-Encoding-Type
MIME header's "mechanisms", which are a mixture of two different things
7bit
, 8bit
, and binary
are domains (only binary
is relevant to JSON Schema)base64
and quoted-printable
are encodings (along with the implicit identity encoding)7bit
, 8bit
, and binary
as values all imply the identity encodingbase64
and quoted-printable
are relevant to actually encoding binary data into a 7-bit rangequoted-printable
is actually for encoding 8-bit ASCII (octets) into printable 7-bit ACII, and while it could be used on arbitrary octets, this is extremely inefficientbase64
avoids line length limitations and ignores rather drops unencodable characters
quoted-printable
equivalentI'm going to pause here so I don't lose all of this, more comments shortly.
I think what we want to do here is emphasize RFC 4648, as it is the most relevant to modern web apps shoving content through JSON. Which people do sometimes do. For Reasons.
The place where quoted-printable
or the specific behavior of base64
as found in RFC 2045 is if the JSON is being used with email or otherwise demands old-school MIME compatibility. I don't know how often that is an issue, but that is something that comes up in Hyper-Schema (note the use of "contentMediaType" to work with one section of a multipart email body).
So I think what I want to do is:
quoted-printable
specifically from RFC 2045 rather than just pointing at a vague sectionbase64
which could be appropriate in MIME contexts, but otherwise assume the RFC 4648 base64
Or something like that. Thoughts?
I would suggest to only mention RFC 4648 (the base* encodings) You can also mention that it's allowed to use other values for contentEnconding but then it's up to the user to know what they mean. I'm sure there exist other weird characters encodings apart from 'quoted-printable' and referencing just this one is unnecessary and distracting.
@janssk1 while I'm sure there are other "weird" character encodings, quoted-printable
is a very well-established one that's been referenced by JSON Schema from the beginning, and I'm reluctant to drop it. Particularly because the implementation burden is essentially non-existent. A JSON Schema implementation need only inform the calling application about the encoding. It does not need to provide encoding/decoding services.
I don't really see much downside to leaving it in.
To add some fuel to the fire, we define the following in OpenAPI:
{
"di": 2000,
"da": 2100,
"mat": "steel",
"dwg_template": "cid:sfsf@sdgfs"
}
Where the CID points to the binary data.
@michael-o that looks like a URI (URN?) of some sort which is very different from what these keywords are doing. These keywords are only for content embedded in JSON Strings, not references to contents.
@handrews , that is correct. RFC 2392. Wir are sending multipart/related
where we use CID URLs to reference out of message information. Mostly binary data which cannot be transported in JSON. This is basically the same approach was with SOAP MTOM.
@handrews , i noticed that the link https://json-schema.org/understanding-json-schema/reference/non_json_data.html is not updated. It still contains the same text as reported in this ticket.
Also, how should someone you express in json schema that a certain string contains base64'ed gzipped binary data ? In HTTP, the contentEncoding header is used to express what kind of (binary) compression is used. In json schema, the contentEncoding expresses what encoding is used to transfer binary data to (UTF-8) text. Should there be another header in JSON schema to model compression as well (eg contentCompression) ?
@janssk1 I think this is a transport issue, not related to the schema. You can turn that into an object having value
, compression
properties. Same as in Content-Encoding
header.
At the end, Base 64 is the worst approach. We are externalizing binary resources with multipart/related
messages.
@michael-o If the complete json document is gzipped, then it is indeed a transport issue. I'm talking about the case where just one field of the json contains gzipped, base64'ed content.
If i understand you correctly, you are suggesting to explicitly manage the compression in json, not json schema. Example below. That's possible, that means you have to pollute data messages with meta information, something a schema should do AFAIK.
{"some": {"normal": "json"},
"my-data": "
Yes, you will to pollute it unfortunately, because you don't a separation like in HTTP headers and body. What you also could do it to detect compression on the fly by peeking magic bytes.
@janssk1 Understanding JSON Schema, while loosely under the JSON Schema Organization umbrella, has its own repository and maintainers. This repository is only for the specification itself. Understanding JSON Schema also only covers up to draft-07, currently, I think, so any change to a forthcoming draft will of course not be reflected there yet.
As for your gzip issue, JSON Schema does not claim to have a keyword for every possible use case of putting things in JSON. This is why, as of draft 2019-09, we have added features to support modular, re-usable extension vocabularies. As 2019-09 (or later) implementations become available, we are encouraging folks to design extension keywords to handle use cases that are not covered by the core and validation specification documents. If you want to request a keyword in a vocabulary, there is a JSON Schema Vocabularies repository where you can file that.
Note that we (the JSON Schema Organization) are not building extension vocabularies. We are trying to finalize the core and validation specification into standards. We are just providing a repository where people can park ideas for keywords and other people who would like to collaborate on extensions can find those ideas.
Please, if relevant, file the appropriate issues with the appropriate repository. This issue is closed and should not be used for additional concerns. You are also welcome to join our slack workspace to discuss things in more detail (there's a "Join Our Slack" link on json-schema.org in the upper right corner). GitHub issues are not a great place for a lot of back-and-forth on things.
https://json-schema.org/understanding-json-schema/reference/non_json_data.html
defines 7bit, 8bit, binary, quoted-printable and base64 as values for contentEncondign and refers to RFC 2054, part 6.1 to justify that list.
However, that particular RFC has a slighly different goal. It defines how to optionally transfer binary octets to characters. However, for the json schema case, the binary data must be encoded for it to fit in a json structure, so 7bit, 8bit and binary don't make sense.
PS: It would be good to include hex as a encoding option as well