handrews commented 6 years ago

[this is a bit stream-of-consciousness, but I wanted to get it filed because I keep forgetting- we'll clean up the ideas here on the way to PRs]

format confuses pretty much everyone. I have noticed people filing issues against various implementations complaining of imperfect enforcement (I believe @Julian has received complaints about "email", and @johandorland about "hostname", and I suspect they are not alone).

format, contentMediaType, and contentEncoding are essentially best effort validation keywords in practice. Many if not most implementations make at least some effort to validate format. I'm not sure if anything attempts that for content* as they are new (at least as part of the validation spec), and they would essentially require parsing the string encoding and media type which is potentially very expensive.

Complicating the matter for format is the fact that many of the relatively fundamental internet-related formats such as "email" and "hostname" are very old, and conformance to specifications is rather complicated.

For "hostname", RFC 1034 forbids leading digits, but this is sometimes ignored in practice, leading to ambiguous overlap with "ipv4" as a format. In practice, most programs that accept hostnames will also accept ipv4 addresses and just recognize that no DNS resolution is required, so this is rarely a concern.

The difficulty of validating email addresses, even on the syntactical level, is well-documented (try finding a regular expression that will do it, for instance, and if you find an actual iron-clad one, let me know).

Leveraging our relatively recent keyword classification work, I think it is best to classify these primarily as annotations rather than treating them as some sort of hybrid annotation+assertion. Annotations can specify any intent, including semantic validation or parsing instructions. The specification should provide guidance on how an implementation might directly offer handlers for such intents, and how to indicate the available level of support.

Applications can, as with any annotation, then perform additional processing if the implementation either does not offer any validation, or offers only incomplete validation. The spec already says that implementations SHOULD offer an ability to turn semantic validation off, so we can extend that guidance (probably at the MAY level) to cover situations like allowing hooks for application-defined processing in addition to or in place of implementation-supplied validation.

And of course, all of this is dependent on an implementation supporting annotations. As with the additionalProperties and additionalItems keywords (now in the core spec and defined in terms of annotation collection), the spec should allow for the existing sort of implementations to continue to be valid and in conformance for implementations that do not implement general annotation collection support.

johandorland commented 6 years ago

The spec could make it even clearer that format is on a best effort only as almost every RFC it mentions is a bottomless pit of complexity. Having said that end users rarely read the JSON schema RFC and format is a pretty useful tool to have to validate common patterns, which is why it gets implemented fairly often. If you move this responsibility from the implementation to an application it makes JSON schema less powerful and puts an extra burden on the developer. The current situation where we get bug reports about not strictly implementing some archaic 1990's RFC is annoying, but it's not that bad.

As for contentMediaType and contentEncoding I agree. The scope of those two is just too broad and way broader than format and basically impossible to properly implement for validation purposes. Sure it's not that hard to check if a base64 encoded string is valid application/json, but I hope nobody expects it's JSON schema implemention to verify whether something is indeed a properly encoded WebM video.

handrews commented 6 years ago

@johandorland for format, I need to figure out the right wording as "the application" could actually just be a set of plugins (or even built-in code) that are part the implementation. They don't have to be literally part of the application that uses the implementation. Many implementations support format via plugins already. Mostly, I want to encourage:

Plugins over hardcoding, unless of course someone wants to optimize the heck out of it
A formal recommendation that allows for multiple levels of validation, so the implementation may, for example, only "validate" email addresses to the extent of making sure they contain an @ character (this is what the most popular Python implementation actually does, btw), but also allow a 3rd party (application or otherwise) to register additional validation logic. To whatever degree that they want.

I want it to be easier for implementation authors to say "yes, we implement it up to this point, but if you want more you can add it easily".

johandorland commented 6 years ago

@handrews In that case we're on the same line. It's how I interpreted the spec already, but after reading up a bit I can imagine not everyone does.

handrews commented 6 years ago

@johandorland thanks, this will all help make the wording more clear!

Relequestual commented 6 years ago

clearer* Someone hasn't had their coffee yet huh? (I'll actually read this tomorrow!)

gregsdennis commented 6 years ago

I like the idea of supporting validation for these loose keywords via "plugins." It's not something I had considered. It'll be available in my next version!

Relequestual commented 6 years ago

On the topic of email regex, ya'll gotta read https://www.regular-expressions.info/email.html

On the topic of this issue: Yeah. I really like the way ajv allows you to add or replace format validation. Explicitly stating that the application MAY do that gives people the notion it could be a good idea. Great!

gregsdennis commented 6 years ago

I'm surprised that no one's mentioned #563 yet. Seems pertinent to me.

I think implementations allowing their clients to append/modify the format validation that is available is the way to go. I'm taking a page from ajv's book and opening up the format validation so that, while I provide some stock validations, my client can define their own or even override the ones I have to suit their specific needs.

handrews commented 5 years ago

@gregsdennis if you have any concerns on the implementation requirement wording around content* or format, please add them here. I'm going to review those sections based on this issue after all of the new feature PRs are done (or at least submitted).

ebolwidt commented 5 years ago

Concern about the contentEncoding; draft-07 references RFC2045 section 6.1 which defines the following values:

"7bit" / "8bit" / "binary" / "quoted-printable" / "base64" / ietf-token / x-token

It would be good to go into more detail. "7bit" is possible in JSON but useless (would have to reject validation of string has any character with a high bit set); "8bit" and "binary" are not possible in JSON. "base64" is useful. "quoted-printable" is possible but is there any use for it? Extension tokens are possible.

Can we say that if contentEncoding / contentMediaType are supported by a validator, then contentEncoding MUST support base64, and it MAY support extension encodings or 7bit/quoted-printable (through plugins) and 8bit/binary are NOT ALLOWED?

handrews commented 5 years ago

@ebolwidt this keyword (and contentMediaType) is more of an annotation than a validation assertion. In theory you could try to validate it, but it's more a statement to the application of "this is what is in here, in case you want to decode it and use it". It's probably worth some clarification, but if it's not possible to use an encoding in JSON then it's just not possible - there's no coercion of content or anything like that going on.

ebolwidt commented 5 years ago

@handrews I see what you're saying. I'm not so much interested in contentMediaType validation, which is probably a too broad topic in any case - but in production application, I'm using base64 quite often and validating that a string is valid base64 is useful. It seems it could be done with a regex (https://stackoverflow.com/questions/475074/regex-to-parse-or-validate-base64-data) although not clear if this caters to every possible corner case - but it would be very useful to make a validation assertion easily with either contentEncoding or format that a string should be valid base64 data, like "contentEncoding": "base64" or "format": "base64" - and there is clearly some overlap between the use cases of these two keywords.

handrews commented 5 years ago

@ebolwidt this is getting into a very deep topic that will be a focus in draft-09 which is what to do with the fact that the format and content* keywords definitely function as annotation keywords, but optional function as assertions. Which is a muddled mess that has caused all kinds of problems as it's just inherently confusing, and there is no way for a schema author to know whether or how well a given implementation will perform the validation.

There are some possible ideas around extending draft-08's vocabulary concept to help manage this, but it has been punted to draft-09 because it's really not obvious what the best option is. We will come back to this over the next few months.

In the meantime, if you want to guarantee validity, using pattern is probably your best option. See also #54 for an old discussion of fallback validation specifically for format.

bvisness commented 5 years ago

From what I am reading, it appears that you all may be moving away from using format as validation, but I wanted to clarify and correct something.

Currently (draft-07) the hostname format requires the format defined in RFC 1034, where labels are not allowed to begin with a digit. You wrote in the initial comment on this issue that leading digits are "sometimes ignored in practice". But leading digits are actually explicitly allowed in RFC 1123. I don't know if RFC 1123 officially updates or obsoletes RFC 1034, but leading digits are certainly common in practice, and RFC 5322 (the email spec, upon which the email format depends) depends on both RFC 1034 and RFC 1123.

Regardless of where you end up going with format, if any definition for hostname sticks around, it would be wise to include RFC 1123 in its definition.

(As for why I care, my company had code using a JSON schema validator library blow up today because we have customers with leading digits in their email addresses. Our library had used their hostname validator for the domain part of the email validator, but their hostname validator was only compliant with RFC 1034. Technically this should have been allowed by their email address validator, but the inconsistency in the spec caused an unfortunate shortcut.)

handrews commented 5 years ago

@bvisness I haven't the foggiest idea why RFC 1123 wasn't referenced, that predates the involvement of any of the current spec maintainers. That's easily fixed in the next draft.

As far as "moving away from using format as validation", format has never been reliable for validation. Which is a mess, and we're trying to make it less of a mess. That might involve a way to require format validation (and fail fast if it is not available), or might involve saying that the spec does not include format validation at all to make it clear.

Basically, "optional validation" is horribly confusing, but for some formats (notably email) truly reliably validating them is quite burdensome.

bvisness commented 5 years ago

Makes sense. Honestly I think I would prefer if JSON schema didn't have any format validation at all - I might even prefer if it didn't have format at all. For JSON schema to enforce arbitrary string formats feels like scope creep of the highest degree (although I think pattern is good and useful).

Obviously, as a user, it's convenient to have someone else do all the validation work for you. But since it appears the long-term "fix" is all this meta-schema and $vocabulary stuff (which, frankly, sounds bonkers), it looks to my untrained eye like you'd be better off requiring users to do their own validation. I would really prefer that JSON schema remains concerned with the structure of JSON documents instead of the infinite ways you could encode data into a string.

handrews commented 5 years ago

@bvisness the idea is that the majority of people will just keep referencing meta-schemas with $schema exactly like they do now, and only a few people (who write the meta-schemas) will have to understand $vocabulary. So it is not held to quite the same usability standards as regular schema keywords.

Presumably, if you want to make up your own keywords and have other people implement them the same way, you're willing to dig deeper into how it all works. If you want to make up keywords and don't care if anyone else understands them, you can keep doing that how you do it now (basically, hardcode stuff in a private implementation).

Classifying format as an annotation (data passed back to the caller and associated with the instance location) instead of an annotation and optional validation would be a lot more predictable.

Relequestual commented 5 years ago

Having reviewed this and taking another look at #54 , I think THAT issue might be a way to resolve this, but it's going to require a lot more sounding out and chatting to implementers and schema authors than our schedule for draft-8 allows.

As such, I feel this should be shifted to draft-9, but with the assurance that draft-9 will look to look for a well considered general consensus solution.

handrews commented 5 years ago

OK I'm going to do something about this leveraging vocabularies (sorry @Relequestual I know I pushed you to move it out to draft-09 but I think my original intentions here need to be handled now. Other folks added a bunch of stuff here, and if those are still relevant after draft-08 goes out they will need to be filed separately.

handrews commented 5 years ago

764 and #767 together fixed this.

json-schema-org / json-schema-spec

Explain format (and content*) more clearly #646

764 and #767 together fixed this.