7bit, 8bit, binary don't make sense for contentEncoding

janssk1 commented 5 years ago

https://json-schema.org/understanding-json-schema/reference/non_json_data.html

defines 7bit, 8bit, binary, quoted-printable and base64 as values for contentEncondign and refers to RFC 2054, part 6.1 to justify that list.

However, that particular RFC has a slighly different goal. It defines how to optionally transfer binary octets to characters. However, for the json schema case, the binary data must be encoded for it to fit in a json structure, so 7bit, 8bit and binary don't make sense.

PS: It would be good to include hex as a encoding option as well

handrews commented 5 years ago

Content encoding now also references RFC 4648 already, so that takes care of hex (among other things).

As for RFC 2045 (not 2054), that being the reference for this keyword long predates the involvement of any of the current specification editors. As we are likely to publish the next draft this week (probably tomorrow, in fact), I'm not going to make a backwards-incompatible change of removing possible values for the field right now.

In any event, the point of that field is to tell you that the string in question was encoded that way. The spec no longer treats this field as a sort-of ambiguous validation constraint. It is now strictly an annotation- the application will need to decode the string so they can "validate" it that way (the security considerations note the potential problems with this).

So I think this works fine as-is, but I'll leave this open and you're welcome to make a further case for changing it in the next draft (there will be at least one more).

janssk1 commented 5 years ago

Thanks for the info. RFC 4648 does seem a better reference. IMHO, dropping the reference to RFC 2054 (and associated 7bit, 8bit, binary) will remove the confusion. But i understand this may break backward compatiblity (although i doubt anybody will be using 8bit etc, since it's not clear what it actually means for a json string)

handrews commented 4 years ago

@janssk1 wow this is a mess, thanks for calling this out.

So, to summarize, as best as I can figure out:

JSON (specifically RFC 8259) mandates a UTF-8 encoding (except for closed systems, with which we needn't concern ourselves)
- Earlier JSON RFCs matching earlier drafts of JSON Schema allowed other Unicode encodings, but only Unicode, never smaller sets like ASCII
The history of contentEncoding is a bit convoluted
- It was in drafts 00-03 (when there was only one JSON Schema spec)
- It moved into Hyper-Schema for 04-06, where it became the binaryEncoding property of the media object (the current editors took over after draft-04)
- It moved to Validation and returned to the name contentEncoding in draft-07
- Throughout this, it just references RFC 2045 §6.1, and always refers to the encoded data as being binary in nature
- In our current keyword taxonomy, it's an annotation with no validation implications, and it is up to applications to decide how to act on it
- We hardly ever hear anything about it, so between it moving back and forth between specifications, the lack of commentary, and the flexibility of annotations, I suspect we can do whatever we want with it without causing problems.
RFC 2045 is talking about encoding data from various domains into 7-bit ASCII with line length limitations
- §6.1 lists the Content-Encoding-Type MIME header's "mechanisms", which are a mixture of two different things
  - 7bit, 8bit, and binary are domains (only binary is relevant to JSON Schema)
  - base64 and quoted-printable are encodings (along with the implicit identity encoding)
- This is only properly clarified in §6.2 which explains that
  - 7bit, 8bit, and binary as values all imply the identity encoding
  - Only base64 and quoted-printable are relevant to actually encoding binary data into a 7-bit range
  - quoted-printable is actually for encoding 8-bit ASCII (octets) into printable 7-bit ACII, and while it could be used on arbitrary octets, this is extremely inefficient
RFC 4648 is what I would use if I were adding this from scratch
- Its base64 avoids line length limitations and ignores rather drops unencodable characters
  - It does not have a quoted-printable equivalent

I'm going to pause here so I don't lose all of this, more comments shortly.

handrews commented 4 years ago

I think what we want to do here is emphasize RFC 4648, as it is the most relevant to modern web apps shoving content through JSON. Which people do sometimes do. For Reasons.

The place where quoted-printable or the specific behavior of base64 as found in RFC 2045 is if the JSON is being used with email or otherwise demands old-school MIME compatibility. I don't know how often that is an issue, but that is something that comes up in Hyper-Schema (note the use of "contentMediaType" to work with one section of a multipart email body).

So I think what I want to do is:

Emphasize RFC 4648 (which we already do, but make it clearly the primary reference)
Reference quoted-printable specifically from RFC 2045 rather than just pointing at a vague section
Note that RFC 2045 also has a slightly different base64 which could be appropriate in MIME contexts, but otherwise assume the RFC 4648 base64

Or something like that. Thoughts?

janssk1 commented 4 years ago

I would suggest to only mention RFC 4648 (the base* encodings) You can also mention that it's allowed to use other values for contentEnconding but then it's up to the user to know what they mean. I'm sure there exist other weird characters encodings apart from 'quoted-printable' and referencing just this one is unnecessary and distracting.

handrews commented 4 years ago

@janssk1 while I'm sure there are other "weird" character encodings, quoted-printable is a very well-established one that's been referenced by JSON Schema from the beginning, and I'm reluctant to drop it. Particularly because the implementation burden is essentially non-existent. A JSON Schema implementation need only inform the calling application about the encoding. It does not need to provide encoding/decoding services.

I don't really see much downside to leaving it in.