json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.82k stars 266 forks source link

Improved handling of unknown properties #1365

Closed awwright closed 5 months ago

awwright commented 1 year ago

There are many applications where a schema is distributed for use by numerous, arbitrary validators. For example, an API describes to clients a range of values that are acceptable as input, or to describe exception cases that require special handling. In these applications, it’s important that all validators agree on the set of instances that a particular schema accepts and rejects. (It goes without saying by “validator” I mean “compliant validator.”)

However, JSON Schema has one quirk in particular that prevent this from being possible: Unknown keywords are treated if they didn’t exist. For a constraint-based system, this is a serious problem when a new keyword is introduced. If a schema author uses a newly standardized validation keyword, it’s likely for an important reason that should not be ignored—which is what older validators will do. While it should be possible to write keywords that are never ignored, this is not actually possible.

There’s two obvious ways to support this application; one is to introduce a scheme so that some properties in an object can be designated “must-understand” and require validators to implement this. The other is to report/handle keywords that are unknown, so the application can react appropriately.

The $vocabulary and $schema keywords offer partial solutions of the first class, but they are not complete. These keywords are a level of abstraction removed from the keywords the schema actually requires for correct processing; as a consequence, in the worst case, you can have two validators that cannot handle any schemas in common: you may have situations where two clients cannot be supported at the same time, even if their behavior is the same.

It seems that two schemas which have the same behavior, should also be supported by the same validators. So a general solution to this problem is likely to be a of the second class: Don’t require unknown keywords be ignored; but define some option to handle them unless they’re known to not be validation keywords (i.e. they are known to never reject an instance).

That is, the specification should describe how to identify those keywords that are unknown; and describe special cases where unknown keywords should not be ignored, but instead can be reported, raise an exception, or (if the details are unimportant) simply reject. Most validators can suffice with a runtime option: "Unknown keyword behavior: accept/error/reject". (The behavior currently required of implementations would be the "accept" option.)

I carefully word it this way so that, at least for now, there is no change in how schemas are processed (by default); rather this accommodates a class of applications that can enable stricter handling when that's what the application requires, and the impact of setting “error” as the default behavior can be better analyzed. (Though this can be decided now too, if some need presents itself.)


This is the first in a series of changes to implement JSON Schema Extensions, including:

It’s possible It may be preferable to implement some of these first, since that would lower the potential impact of this change.

Julian commented 1 year ago

Was this meant for the spec repo? I assume so given it's describing things which aren't yet in the spec?

gregsdennis commented 1 year ago

Also ~potentially~ very closely related to https://github.com/orgs/json-schema-org/discussions/241

awwright commented 1 year ago

@Julian Yes, it's proposing a change to the JSON Schema Core document, the next step is to draft some specific patches/PRs; I write an issue first to try to limit the scope of that patch and gather feedback.

@gregsdennis Right this is a good point, and I really should have commented there sooner. This issue would implement most the ideas off of that discussion thread, including standardizing a way of classifying all of the keywords that appear in a schema, though not (yet) going as far as prohibiting non-vocabulary keywords across the board.

Julian commented 1 year ago

OK, transferring it there then, as I don't see what it has anything to do with the test repo. Sounds like you're agreeing that's where you meant to put it?

awwright commented 1 year ago

(Did I have the wrong tab open or something? I'm quite embarrassed I didn't catch that, thank you.)

Julian commented 1 year ago

:D all good -- figured that was all it was, just wanted to make sure I wasn't dreaming.

gregsdennis commented 1 year ago

In these applications, it’s important that all validators agree on the set of instances that a particular schema accepts and rejects.

You're very keen on using this "set" language, but I'm not sure I've encountered that from others or in the spec. Is this something that you want to introduce? I'm not sure that most readers will find thinking of open-ended data (as JSON is) in terms of sets. (FWIW, I've heard this language being used by the WG for JSON Path.)

jdesrosiers commented 1 year ago

My impression is that we already have consensus that we want to make this change. The significant difference in this proposal is that forbidding unknown keywords is off by default. The way we've discussed it elsewhere, forbidding unknown keywords would be on by default or required. If we allow unknown keywords by default, we can't say our releases are forward-compatible by default. Users have to configure something to ensure that their schemas are forward-compatible. Forbidding unknown keywords by default is the kind of a backward-incompatible change I think we should be willing to make so we can make this forward compatibility guarantee from the next release forward.

awwright commented 1 year ago

You're very keen on using this "set" language, but I'm not sure I've encountered that from others or in the spec.

I use "set" in the mathematical sense and this allows for some brevity... by saying "the sets ought to be the same," I'm saying "an instance that's valid by one implementation should be valid if-and-only-if it's valid in all the other implementations"

Is this something that you want to introduce? I'm not sure that most readers will find thinking of open-ended data (as JSON is) in terms of sets. (FWIW, I've heard this language being used by the WG for JSON Path.)

To talk about "all of the instances that match these given conditions", I think the best term is "set". Most of the sets will be infinite, but this is typical in math and formal languages, which is essentially what JSON Schema is. It's true that all computers are finite, so while "Set" the C++ memory structure might quickly run out of RAM, other techniques like generators can run past the heat death of the universe before ever running out of RAM.

awwright commented 1 year ago

My impression is that we already have consensus that we want to make this change.

The devil is in the details, and I'm continuing to pile on more details, I'll be filing some exact language next. And this is also the first core-spec issue filed about this, which is a minimum requirement for gathering consensus from the general public.

The significant difference in this proposal is that forbidding unknown keywords is off by default. The way we've discussed it elsewhere, forbidding unknown keywords would be on by default or required.

I'm avoiding saying it ought to be one way or the other at this point; because to do that, we have to answer "what does it even mean to have an unknown keyword" and standardize that. If it turns out adding "... and error" is a really obvious and logical thing then sure.

But particularly, 180-ing any requirement from "must ignore" to "must not ignore" should be viewed with suspicion. I think the solution will involve some middle ground where implementations can add "strict mode" as an opt-in feature, and then make that the default behavior after a major version/breaking release (but can be switched off by the user as necessary).

jdesrosiers commented 1 year ago

I didn't mean to imply that the issue was unwanted. Although this has been discussed in a couple different places already (https://github.com/orgs/json-schema-org/discussions/234#discussioncomment-3664363, https://github.com/orgs/json-schema-org/discussions/241), this is the only place that addresses just that issue. In the other places it was a part of a larger discussion. However, it would have been nice if this issue acknowledged that those discussions had taken place and summarized the result of those discussions.

In any case, my intention was to call out that we probably don't need to rehash the whole discussion and to point out the differences in what is proposed and what has previously been discussed so we can focus discussion on resolving those differences.

180-ing any requirement from "must ignore" to "must not ignore" should be viewed with suspicion.

I completely agree and that was brought up in the linked discussions. The people who participated in those discussions agreed that it was worth making the change so we could provided a forward compatibility guarantee. Of course I'm always willing to revisit a discussion when more voices join in.

and then make that the default behavior after a major version/breaking release

I know you think of it differently, but the majority of us consider that our releases are not compatible and are all effectively major version/breaking releases. However, we do all agree that we want compatible releases and are working to have that starting in the next release. So, the next release is a major version/breaking release with the intention of it being the last time we do that. Now is our last chance to make breaking changes and I think it's worth it to be able to provide forward compatibility guarantees.

gregsdennis commented 1 year ago

I use "set" in the mathematical sense and this allows for some brevity...

I understand that's your use. My point is that I don't think most readers understand sets. It's not language that's currently in the spec, and I question its introduction. If we agree that this language paradigm is fine, then sure. But for now, I think it's best to stick with the concepts we already have until they are no longer sufficient to express our meaning.

I'm definitely for the topic of this (i.e. disallowing unknown keywords). I'm just quibbling over the verbiage.

I'm also happy for this issue to be the "unknown keywords" issue associated with that discussion.