Closed handrews closed 5 years ago
The spec could make it even clearer that format
is on a best effort only as almost every RFC it mentions is a bottomless pit of complexity. Having said that end users rarely read the JSON schema RFC and format
is a pretty useful tool to have to validate common patterns, which is why it gets implemented fairly often. If you move this responsibility from the implementation to an application it makes JSON schema less powerful and puts an extra burden on the developer. The current situation where we get bug reports about not strictly implementing some archaic 1990's RFC is annoying, but it's not that bad.
As for contentMediaType
and contentEncoding
I agree. The scope of those two is just too broad and way broader than format
and basically impossible to properly implement for validation purposes. Sure it's not that hard to check if a base64
encoded string is valid application/json
, but I hope nobody expects it's JSON schema implemention to verify whether something is indeed a properly encoded WebM video.
@johandorland for format
, I need to figure out the right wording as "the application" could actually just be a set of plugins (or even built-in code) that are part the implementation. They don't have to be literally part of the application that uses the implementation. Many implementations support format
via plugins already. Mostly, I want to encourage:
@
character (this is what the most popular Python implementation actually does, btw), but also allow a 3rd party (application or otherwise) to register additional validation logic. To whatever degree that they want.I want it to be easier for implementation authors to say "yes, we implement it up to this point, but if you want more you can add it easily".
@handrews In that case we're on the same line. It's how I interpreted the spec already, but after reading up a bit I can imagine not everyone does.
@johandorland thanks, this will all help make the wording more clear!
clearer* Someone hasn't had their coffee yet huh? (I'll actually read this tomorrow!)
I like the idea of supporting validation for these loose keywords via "plugins." It's not something I had considered. It'll be available in my next version!
On the topic of email regex, ya'll gotta read https://www.regular-expressions.info/email.html
On the topic of this issue: Yeah. I really like the way ajv allows you to add or replace format validation. Explicitly stating that the application MAY do that gives people the notion it could be a good idea. Great!
I'm surprised that no one's mentioned #563 yet. Seems pertinent to me.
I think implementations allowing their clients to append/modify the format validation that is available is the way to go. I'm taking a page from ajv's book and opening up the format validation so that, while I provide some stock validations, my client can define their own or even override the ones I have to suit their specific needs.
@gregsdennis if you have any concerns on the implementation requirement wording around content*
or format
, please add them here. I'm going to review those sections based on this issue after all of the new feature PRs are done (or at least submitted).
Concern about the contentEncoding
; draft-07 references RFC2045 section 6.1 which defines the following values:
"7bit" / "8bit" / "binary" / "quoted-printable" / "base64" / ietf-token / x-token
It would be good to go into more detail. "7bit" is possible in JSON but useless (would have to reject validation of string has any character with a high bit set); "8bit" and "binary" are not possible in JSON. "base64" is useful. "quoted-printable" is possible but is there any use for it? Extension tokens are possible.
Can we say that if contentEncoding / contentMediaType are supported by a validator, then contentEncoding MUST support base64, and it MAY support extension encodings or 7bit/quoted-printable (through plugins) and 8bit/binary are NOT ALLOWED?
@ebolwidt this keyword (and contentMediaType
) is more of an annotation than a validation assertion. In theory you could try to validate it, but it's more a statement to the application of "this is what is in here, in case you want to decode it and use it". It's probably worth some clarification, but if it's not possible to use an encoding in JSON then it's just not possible - there's no coercion of content or anything like that going on.
@handrews I see what you're saying. I'm not so much interested in contentMediaType validation, which is probably a too broad topic in any case - but in production application, I'm using base64 quite often and validating that a string is valid base64 is useful. It seems it could be done with a regex (https://stackoverflow.com/questions/475074/regex-to-parse-or-validate-base64-data) although not clear if this caters to every possible corner case - but it would be very useful to make a validation assertion easily with either contentEncoding or format that a string should be valid base64 data, like "contentEncoding": "base64"
or "format": "base64"
- and there is clearly some overlap between the use cases of these two keywords.
@ebolwidt this is getting into a very deep topic that will be a focus in draft-09 which is what to do with the fact that the format
and content*
keywords definitely function as annotation keywords, but optional function as assertions. Which is a muddled mess that has caused all kinds of problems as it's just inherently confusing, and there is no way for a schema author to know whether or how well a given implementation will perform the validation.
There are some possible ideas around extending draft-08's vocabulary concept to help manage this, but it has been punted to draft-09 because it's really not obvious what the best option is. We will come back to this over the next few months.
In the meantime, if you want to guarantee validity, using pattern
is probably your best option. See also #54 for an old discussion of fallback validation specifically for format
.
From what I am reading, it appears that you all may be moving away from using format
as validation, but I wanted to clarify and correct something.
Currently (draft-07) the hostname
format requires the format defined in RFC 1034, where labels are not allowed to begin with a digit. You wrote in the initial comment on this issue that leading digits are "sometimes ignored in practice". But leading digits are actually explicitly allowed in RFC 1123. I don't know if RFC 1123 officially updates or obsoletes RFC 1034, but leading digits are certainly common in practice, and RFC 5322 (the email spec, upon which the email
format depends) depends on both RFC 1034 and RFC 1123.
Regardless of where you end up going with format
, if any definition for hostname
sticks around, it would be wise to include RFC 1123 in its definition.
(As for why I care, my company had code using a JSON schema validator library blow up today because we have customers with leading digits in their email addresses. Our library had used their hostname validator for the domain part of the email validator, but their hostname validator was only compliant with RFC 1034. Technically this should have been allowed by their email address validator, but the inconsistency in the spec caused an unfortunate shortcut.)
@bvisness I haven't the foggiest idea why RFC 1123 wasn't referenced, that predates the involvement of any of the current spec maintainers. That's easily fixed in the next draft.
As far as "moving away from using format
as validation", format
has never been reliable for validation. Which is a mess, and we're trying to make it less of a mess. That might involve a way to require format validation (and fail fast if it is not available), or might involve saying that the spec does not include format validation at all to make it clear.
Basically, "optional validation" is horribly confusing, but for some formats (notably email) truly reliably validating them is quite burdensome.
Makes sense. Honestly I think I would prefer if JSON schema didn't have any format validation at all - I might even prefer if it didn't have format
at all. For JSON schema to enforce arbitrary string formats feels like scope creep of the highest degree (although I think pattern
is good and useful).
Obviously, as a user, it's convenient to have someone else do all the validation work for you. But since it appears the long-term "fix" is all this meta-schema and $vocabulary
stuff (which, frankly, sounds bonkers), it looks to my untrained eye like you'd be better off requiring users to do their own validation. I would really prefer that JSON schema remains concerned with the structure of JSON documents instead of the infinite ways you could encode data into a string.
@bvisness the idea is that the majority of people will just keep referencing meta-schemas with $schema
exactly like they do now, and only a few people (who write the meta-schemas) will have to understand $vocabulary
. So it is not held to quite the same usability standards as regular schema keywords.
Presumably, if you want to make up your own keywords and have other people implement them the same way, you're willing to dig deeper into how it all works. If you want to make up keywords and don't care if anyone else understands them, you can keep doing that how you do it now (basically, hardcode stuff in a private implementation).
Classifying format
as an annotation (data passed back to the caller and associated with the instance location) instead of an annotation and optional validation would be a lot more predictable.
Having reviewed this and taking another look at #54 , I think THAT issue might be a way to resolve this, but it's going to require a lot more sounding out and chatting to implementers and schema authors than our schedule for draft-8 allows.
As such, I feel this should be shifted to draft-9, but with the assurance that draft-9 will look to look for a well considered general consensus solution.
OK I'm going to do something about this leveraging vocabularies (sorry @Relequestual I know I pushed you to move it out to draft-09 but I think my original intentions here need to be handled now. Other folks added a bunch of stuff here, and if those are still relevant after draft-08 goes out they will need to be filed separately.
[this is a bit stream-of-consciousness, but I wanted to get it filed because I keep forgetting- we'll clean up the ideas here on the way to PRs]
format
confuses pretty much everyone. I have noticed people filing issues against various implementations complaining of imperfect enforcement (I believe @Julian has received complaints about "email", and @johandorland about "hostname", and I suspect they are not alone).format
,contentMediaType
, andcontentEncoding
are essentially best effort validation keywords in practice. Many if not most implementations make at least some effort to validateformat
. I'm not sure if anything attempts that forcontent*
as they are new (at least as part of the validation spec), and they would essentially require parsing the string encoding and media type which is potentially very expensive.Complicating the matter for
format
is the fact that many of the relatively fundamental internet-related formats such as "email" and "hostname" are very old, and conformance to specifications is rather complicated.For "hostname", RFC 1034 forbids leading digits, but this is sometimes ignored in practice, leading to ambiguous overlap with "ipv4" as a format. In practice, most programs that accept hostnames will also accept ipv4 addresses and just recognize that no DNS resolution is required, so this is rarely a concern.
The difficulty of validating email addresses, even on the syntactical level, is well-documented (try finding a regular expression that will do it, for instance, and if you find an actual iron-clad one, let me know).
Leveraging our relatively recent keyword classification work, I think it is best to classify these primarily as annotations rather than treating them as some sort of hybrid annotation+assertion. Annotations can specify any intent, including semantic validation or parsing instructions. The specification should provide guidance on how an implementation might directly offer handlers for such intents, and how to indicate the available level of support.
Applications can, as with any annotation, then perform additional processing if the implementation either does not offer any validation, or offers only incomplete validation. The spec already says that implementations SHOULD offer an ability to turn semantic validation off, so we can extend that guidance (probably at the MAY level) to cover situations like allowing hooks for application-defined processing in addition to or in place of implementation-supplied validation.
And of course, all of this is dependent on an implementation supporting annotations. As with the
additionalProperties
andadditionalItems
keywords (now in the core spec and defined in terms of annotation collection), the spec should allow for the existing sort of implementations to continue to be valid and in conformance for implementations that do not implement general annotation collection support.