Standardize implementation-defined behavior

json-schema-org / community

A space to discuss community and organisational related things

83 stars 34 forks source link

Standardize implementation-defined behavior #189

Closed awwright closed 2 years ago

awwright commented 2 years ago

The specification currently specifies that schemas without any "$schema" keyword are "implementation defined" (emphasis added):

The "$schema" keyword SHOULD be used in the document root schema object, and MAY be used in the root schema objects of embedded schema resources. It MUST NOT appear in non-resource root schema objects. If absent from the document root schema, the resulting behavior is implementation-defined.

This under-constrains the validation behavior and permits behavior that I would never expect to be possible (it wouldn't be wrong to say that {"type":"string"} permits null). And it provides no guarantees about reverse-compatibility; updates to the meta-schema that are reverse-incompatible will have broad effects. (That is to say, dialects don't actually achieve the intended goal of forward compatibility, it just pushes this responsibility onto implementations, probably in platform-specific ways.)

Further, there's at least a few implementations that do not read "$schema" and even if they should, requiring this keyword would be impractical. It's perfectly clear what is meant when someone writes {"type":"string"} without any context.

It's understood that implementations aren't always up-to-date on a spec; that some implementations only follow older versions of a spec normally goes without saying. When we say that documents without "$schema" is implementation-defined behavior, we're either being redundant, or we're saying something more than that.

The "$schema" keyword is, of course, useful. It can declare a subset of the JSON Schema vocabulary (e.g. some document databases might not be able to implement the full vocabulary); or user or implementation-specific keywords (vocabularies). And it can be used as a heuristic to decide if older behavior for a keyword should be used.

But it can't solve all versioning problems. A custom meta-schema might not define compatibility with any draft/release of JSON Schema. We still need to define the behavior for these situations.

By comparison, text/html and application/xhtml+xml have a single document that defines how to interpret all documents, even ones marked with an older version number. Some HTML versions allow you to specify a DTD, but these restrict the elements you're allowed to use, they aren't required, they don't actually change the semantics of the elements, and their omission has a standardized behavior.

A simple test for potential solutions is: null (and other values) cannot be valid against {"type":"string"}. Currently the behavior is under-constrained (under-specified) and there's nothing to say that this would be wrong.

Potential solutions:

Error on a missing $schema keyword: This would break a very large number of existing implementations and documents. null wouldn't be valid, but it wouldn't necessarily be invalid either. (And even if we were authoring from scratch, requiring version identifiers in documents is strongly discouraged in Internet media types, and very unpopular among document authors.)
Implementations assume the latest known, standard $schema: This would give stronger guarantees about cross-platform compatibility (presumably, "type" would always mean the same thing). However, this would defeat the intention of "$schema" as a dialect identifier. If an implementation updates its default "$schema", this would break reverse compatibility if the new dialect is not reverse-compatible. (This is also true wherever the behavior is undefined or implementation-defined.)
Single media type specification: Newer drafts replace older drafts in their entirety, though implementations may choose to implement older behavior for reverse compatibility reasons. Because the URI of the meta-schema changes with every draft, "$schema" can be used as a heuristic to determine if a schema is expecting an older, superseded behavior. All releases would have to be reverse compatible, or at least, changes would have to be carefully weighed, especially if there's orthogonal implementations. This was the behavior through draft-07.

gregsdennis commented 2 years ago

I think the intent in that specific paragraph is to say that an implementation can choose which draft it will use to process the schema. I expect most will choose the latest that they support.

Relequestual commented 2 years ago

This makes sense, it might not make sense to proscribe a fatal error for every situation.

My preference would be to change what's defined here. What I'd really like is for hard error if $schema is not provided, however I understand (with frustration) that it might be impractical to require this.

However, I think we COULD say that, the implementation MUST select a dialect it knows about as default, which SHOULD be a JSON Schema org defined dialect.

Additionally I'd like to add that if the dialect is chosen by the implementation (And not via $schema or some other user provided means), then they SHOULD throw a warning (as appropriate for the language) if deprecated keywords are used.

This would be more impacting tools used to write schemas with immediate feedback. I'm not fully aware of the effects this could have for general purpose applications.

awwright commented 2 years ago

I added an abstract in the OP, but I'm not sure if that actually clarified what I'm looking for. I'll try to revise it as I better understand the problem space we're dealing with. (Edit: I ended up rewriting most of the OP.)

My preference would be to change what's defined here. What I'd really like is for hard error if $schema is not provided, however I understand (with frustration) that it might be impractical to require this.

Yes, the behavior should be better specified, and not left undefined. However I'm not sure an error is reasonable. There's no situation where {"type":"string"} should be an error, it's perfectly obvious to me what's being asked for there.

the implementation MUST select a dialect it knows about as default, which SHOULD be a JSON Schema org defined dialect.

This is more reasonable. Implementations shouldn't be allowed to pick any arbitrary behavior on their whim, the selected behavior should be unsurprising.

However I don't think "dialects" are a good way to reason about this. Media types are a way to version.

We're currently having this debate over the application/problem+json media type, how do we add new standard keywords in a way that's backward-compatible; the official spec says this would require a new media type. But due to a concern over media type proliferation, we seem to have agreed that we can define a prefix that specifies new global keywords going forward (maybe * which seems to be unused in the wild).

Relequestual commented 2 years ago

However I don't think "dialects" are a good way to reason about this. Media types are a way to version.

Be that as it may, JSON Schema has MANY uses outside of HTTP requests. And, regardless, JSON Schema defines "what set of stuffs" to use via the dialect identifier.

awwright commented 2 years ago

Be that as it may, JSON Schema has MANY uses outside of HTTP requests.

Sure, though this doesn't necessarily preclude using HTTP and Internet features. I wasn't seriously suggesting this as a solution, but it's interesting to think about.

And, regardless, JSON Schema defines "what set of stuffs" to use via the dialect identifier.

I think I understand your position now, but I still have questions and problems that need addressing:

This issue—the dialect identifier is not required, and cannot be assumed in a cross-platform, backward-compatible way.
Each dialect/meta-schema seems to be co-equal with all the others, but there's nothing stating this. Older drafts remain somewhat hidden. If true, this would contradict the usual rule that newer drafts replace older ones. (In my understanding, validators can implement older behavior, but for the purpose of preventing breakage, not because older drafts have the same endorsement as newer ones.)
It is not clear that $schema can be used to change the behavior of $ref: they're both core keywords; they're not defined as part of the dialect. If maintaining strict backward-compatibility in a cross-platform way is important, we would have to write the older behavior into the specification, as well as the mechanism for detecting when to use it.

jdesrosiers commented 2 years ago

The example that { "type": "string" } should always mean the same thing no matter which dialect is used is not correct. The vocabulary system allows you to define your own keywords even ones who's names and semantics conflict with official vocabularies. I can create a dialect that replaces the official validation vocabulary (that includes the type keyword) with my own validation vocabulary and that defines a type keyword that means something very different. A very realistic example might be a dialect who wants to define more specific types such as int32 and int64 instead of int.

So, an implementation needs to know or assume a dialect somehow to correctly evaluate even the simplest schema. The vocabulary system makes just about anything possible.

In my implementation, I require that a dialect is declared somehow. It doesn't have to be with $schema. There are multiple ways you can declare it, but if you don't it's an error. There's no way to safely assume what dialect was intended.

awwright commented 2 years ago

@jdesrosiers But I didn't say anything to the effect of "no matter which dialect is used"; the example { "type": "string" } has no dialect (or at least no explicit one). In this situation, it cannot be true that null could be valid. This is a perfectly valid schema, and we can't break support for it.

jdesrosiers commented 2 years ago

@awwright I'm not sure what distinction you're trying to make. Every schema is written in some dialect whether that dialect is declared or not. If the dialect is not declared, an implementation may choose one to use to interpret the schema. That dialect does not need to be an official JSON Schema dialect. It could be a dialect where type is just an annotation. In that case, an implementation could evaluate { "type": "string" } against null and not be wrong.

This is a perfectly valid schema, and we can't break support for it.

This is a false assumption. We've made backwards incompatible changes in almost every release. We've never made any guarantees that any keyword will always work a certain way in every dialect past, current, or future. The vocabulary system allows users to create dialects that do almost anything and they are not required to be backwards compatible with official JSON Schema releases.

awwright commented 2 years ago

Every schema is written in some dialect whether that dialect is declared or not

The idea of "dialect" was comparatively recently introduced. E.g. { "type": "string" } is one of the most common schemas in existence. Regardless of how they are technically handled today, they were not authored with the understanding of a dialect (they could not have been). These are the schemas we cannot break.

That dialect does not need to be an official JSON Schema dialect. It could be a dialect where type is just an annotation.

Ok, but { "type": "string" } is obviously not one of these cases. The author clearly intended the standardized meaning.

We've made backwards incompatible changes in almost every release.

I've addressed this; we make very few changes that actually force implementations to change behavior in a breaking way; and only in very well-researched situations (like $ref).

If you mean how we occasionally remove behavior: HTTP, email, etc, also remove behavior in every new release, that's not the same thing as "backwards incompatible" (removing something from a tech spec usually doesn't force implementations to change their behavior).

The vocabulary system allows users to create dialects that do almost anything and they are not required to be backwards compatible with official JSON Schema releases.

Ok, but I'm not talking about custom dialects/vocabularies/meta-schemas. The question is: What happens in the default case?

jdesrosiers commented 2 years ago

The idea of "dialect" was comparatively recently introduced.

The name "dialect" is relatively new (if you consider three years new). The concept is at least as old as the $schema keyword and predates the vocabulary system. The earliest significant example is probably when Swagger chose to use a heavily modified version of JSON Schema. As it evolved and became OpenAPI, at least two more dialects have been defined. The vocabulary system allows you to define your own dialects, but it just formalizes what people have been doing all along.

{ "type": "string" } is one of the most common schemas in existence. Regardless of how they are technically handled today, they were not authored with the understanding of a dialect (they could not have been).

I don't understand what you're trying to say. I can create a dialect that does something weird with type and write that schema with the intention of it using my custom dialect. You seem to talking about a schema written before the vocabulary system existed and where third-party dialects are ignored. In that case, you'd be right. But the vocabulary system does exist and third-party dialects do exist. You can't ignore that.

These are the schemas we cannot break.

You say this as if we're considering a change to the specification that might break that schema. This problem already exists. Custom dialects that can do almost anything are already a reality. You can't put that genie back in the bottle. I'm not defending it. I'm just saying we have to accept what already exists in the wild.

Ok, but I'm not talking about custom dialects/vocabularies/meta-schemas. The question is: What happens in the default case?

We don't have one source of truth for what the default behavior is. Right now, every dialect declares it's own rules and historically, those rules have been very permissive. Even if in a future release we constrain what can be done in the default case, implementations that were written for previous drafts would not be affected. They could still choose whatever dialect they want (or error, or whatever) in the default case. If the default behavior is different between dialects, there's no way for an implementation to know which default to follow.

awwright commented 2 years ago

The concept is at least as old as the $schema keyword and predates the vocabulary system

The ability to use a keyword to change the dialect so that any keyword can mean anything does not go back that far.

The concept of "$schema" was introduced in draft-03, as a to provide a hint to validators. Its use by both authors and validators was completely optional:

5.29. $schema

This attribute defines a URI of a JSON Schema that is the schema of the current schema. When this attribute is defined, a validator SHOULD use the schema referenced by the value's URI (if known and available) when resolving Hyper Schema (Section 6) links (Section 6.1).

A validator MAY use this attribute's value to determine which version of JSON Schema the current schema is written in, and provide the appropriate validation features and behavior. Therefore, it is RECOMMENDED that all schema authors include this attribute in their schemas to prevent conflicts with future JSON Schema specification changes.

This is consistent with "$schema" being a meta-schema reference, that could optionally be used as a versioning heuristic.

At the very earliest, the idea that you could use "$schema" to switch behaviors was draft-04, when Hyper-Schema was published as a separate specification. Even at this point, I don't believe $schema was required in order to switch behavior. If I used a JSON Schema validator that was hypermedia enabled, I would expect the hypermedia functions to work even in the absence of the "$schema" keyword.

In draft-05 a.k.a. draft-wright-json-schema-00, I removed specific references to older values of "$schema", replacing it with a generic paragraph about how it's OK to implement values found in other publications. This is where the idea that "$schema" can switch behaviors properly comes from; and it was very carefully worded to maximize forward compatibility.

I can create a dialect that does something weird with type and write that schema with the intention of it using my custom dialect.

You seem to be caught up on the idea that I can define a custom dialect and give "type" whatever semantics I want. I am specifically saying that I am not using this functionality. When someone publishes a post on Stack Overflow about why "0" is valid against {"minimum": 1}, nobody is asking them what dialect they are using.

We don't need to ask and it doesn't matter. We know we're talking about the validation keywords defined in the latest draft. But how does a validator know this? Where is this written? It appears that it's not.

(Now if you want to adjust the settings, or write an implementation that does something special ($data), or create a new dialect that does something unexpected, go for it; that's not what I'm objecting to here.)

gregsdennis commented 2 years ago

@awwright @jdesrosiers

I get that you two are trying to come to a common understanding, but it seems to me like you're both splitting hairs.

This conversation has veered away from its original purpose, which was to unify implementations' behaviors when $schema is missing. I think we can agree that letting implementations do what they want goes against interoperability.

This isn't an error state like the other cases where we say "implementation-defined" or "undefined" behavior. In this case, we have a schema that can be processed. Implementations should do it the same way.

I don't really have stake in which way this goes, except that I need to know what to implement.

awwright commented 2 years ago

This conversation has veered away from its original purpose, which was to unify implementations' behaviors when $schema is missing. I think we can agree that letting implementations do what they want goes against interoperability.

This isn't an error state like the other cases where we say "implementation-defined" or "undefined" behavior. In this case, we have a schema that can be processed. Implementations should do it the same way.

Yes, I agree with this.

jdesrosiers commented 2 years ago

This conversation has veered away from its original purpose, which was to unify implementations' behaviors when $schema is missing.

Actually, I'm not sure we can answer this question without coming to an agreement on this question about the fundamental nature of how JSON Schema works. But, this discussion isn't making progress and we need to stop until we can come up with a more effective way to have this discussion.

My perspective on the question of what the default behavior should be is that there is no safe choice other than throwing an error. The problem with defaulting to a specific draft is that an implementation can't change that default without possibly breaking their user's code. People will depend on that default not changing, so you can't just change the default to the current version every time there is a new release. It would be fine if each release was backwards compatible, but that's not the case. The only safe thing to do if no dialect is known is to refuse to process the schema.

The other problem is that defining a default behavior would transcend dialects and would effect all implementations in existence regardless of what dialect they were written for. That's why I think the right place for this to be defined is the media-type specification that is currently in progress rather than in our next release. But, globally defining a default behavior would break many existing dialects. For example, OpenAPI 2.0 and 3.0 and MongoDB depend on assuming their dialect is the default. If we say the default is, for example, draft-07, then suddenly those OpenAPI and MongoDB schemas that don't declare a $schema (and can't because $schema isn't supported in those dialects) should technically be interpreted as draft-07.

awwright commented 2 years ago

Except for some standard assumptions, the fundamental nature of how JSON Schema ought to be defined in the specification. These assumptions would be:

A newer release of an Internet-Draft replaces older ones in their entirety.
Requirements in normative references are transitive: their requirements carry the same weight as the specification itself, as well as their normative references, etc (but not other references, like informative references).
- Exception: RFCs that obsolete older RFCs—you can use the newer RFC, or the latest document in the STD (e.g. JSON is STD 90)
Behavior not defined by any specification may be selected to be anything by an implementation. Behavior may be undefined because it's specified as undefined/ignored, or by omission.

@jdesrosiers Perhaps you can confirm which assumptions you're making, or which of these you disagree with, and add to this list anything you think is relevant.

For example: You've mentioned how $schema can be used to switch behaviors for $ref, a core keyword. But this is not clear. Section 3 defines a dialect as a "set of vocabularies"; Section 8 says that "$schema" sets the dialect; that the core vocabulary is required; and no keyword (core or otherwise) is defined in terms of the dialect or value of "$schema"; therefore, the core keywords must exhibit the same behavior regardless of dialect, or value of "$schema".

Is this an assumption you're making that ought to be self-evident, is this a proposal you're making, is this a contradiction in the spec to be fixed; or is this a faulty reading on my part?

karenetheridge commented 2 years ago

The problem with defaulting to a specific draft is that an implementation can't change that default without possibly breaking their user's code. People will depend on that default not changing, so you can't just change the default to the current version every time there is a new release.

Agreed. I only don't force this for the sake of brevity and convenience, but try to be clear about it: "if you're not specific about what version you want, it might change, and then you get to keep both halves." https://github.com/karenetheridge/JSON-Schema-Modern/commit/751aea146bc6bda972b40f4f5f7863af67e307a9

karenetheridge commented 2 years ago

For example: You've mentioned how $schema can be used to switch behaviors for $ref, a core keyword. But this is not clear. Section 3 defines a dialect as a "set of vocabularies"; Section 8 says that "$schema" sets the dialect; that the core vocabulary is required; and no keyword (core or otherwise) is defined in terms of the dialect or value of "$schema"; therefore, the core keywords must exhibit the same behavior regardless of dialect, or value of "$schema".

I disagree. I went over this in another issue recently (which I can't find in a search just now) that the only keywords we need to standardize across all versions and dialects are $id and $schema. As long as those two keywords' behaviours are understood in advance, everything else can be determined from that.

for example, this is quite possible:

{
  "$id": "http://example.com/my/strange/dialect",
  "$schema": "http://example.com/my/strange/dialect",
  ... some interesting new keywords...,
  "properties": {
    "$ref": {
      "type": "array"
    }
  }
}

{
  "$id": "http://example.com/my/schema",
  "$schema": "http://example.com/my/strange/dialect",
  "$ref": ["... something... ", "something else.." ]
}

I've just defined a metaschema where the $ref keyword MUST be an array. Now, as long as my implementation knows how to interpret schemas with "$schema": "http://example.com/my/strange/dialect", this works just fine - if I start out by considering the $schema keyword, that immediately tells me what dialect I'm using (whether it's a dialect I already know about, in which case I start following its evaluation rules, or it's a document I don't know, so I have to go fetch it (if my implementation supports that), and then examine its $schema keyword to see if it's a dialect I know, and so on..). I don't look at the $ref keyword until I've already determined what dialect I'm using, and by that point the rules for $ref might be totally different.

awwright commented 2 years ago

As long as those two keywords' behaviours are understood in advance, everything else can be determined from that.

But where in the spec does it say this? It appears that, according to what we have written, this is not actually legal to do.

(Not to get this issue too off-topic—behavior for when $schema is present I'll address in another issue—but I really would like an answer to this question.)

karenetheridge commented 2 years ago

But where in the spec does it say this?

It doesn't. I'm surmising it from how the specification defines keywords and dialects. But it would be good to explicitly say so in future versions.

It appears that, according to what we have written, this is not actually legal to do.

Why?

awwright commented 2 years ago

@karenetheridge I laid out my reasoning in the form of a proof... can you please point out where my interpretation goes wrong? Let me rephrase it:

"$schema" sets the dialect
The dialect is defined as a "set of vocabularies"
"The Core vocabulary MUST be considered mandatory at all times" and "The current URI for the Core vocabulary is: <https://json-schema.org/draft/2020-12/vocab/core>."
Keywords declared in Section 8 (which are all of the "$" keywords, and only "$" keywords) make up the JSON Schema Core vocabulary.
No exceptions are provided for validators to use a different definition for core keywords.
None of these core keywords changes its behavior with respect to the dialect or value of "$schema"
Therefore, the core keywords must exhibit the same behavior regardless of dialect, or value of "$schema".

I'm surmising it from how the specification defines keywords and dialects.

I don't have a problem with leaving things to implication, but we ought to be able to point to the various passages that together enable said behavior.

karenetheridge commented 2 years ago

I believe your statements break down at json-schema-org/json-schema-spec#3 (or maybe even json-schema-org/json-schema-spec#2), because by this point we are already assuming we're using draft2020-12 semantics for the schema. But $schema can indicate that a different version is in use -- a previous version, a subsequent version (that we can only reason about hypothetically), or something that is totally out of band that defines other vocabularies and keywords. Arguably we could state that this no longer JSON Schema, and is therefore entirely out of scope for consideration of discussion, but it doesn't have to be.

I've been suggesting that JSON Schema in the greater sense, transcending individual specification versions, only needs to fix [in the sense of immutably hardcode, not repair] the treatment of $id and $schema. This is also useful in the context of defining how application/schema+json and application/schema-instance+json are to be treated by the IETF, as draft submissions for those media types are being prepared.

But yes, all the other points apply so long as we assume we're operating under draft2020-12 semantics (that is, the chain of metaschemas eventually ends with something with $schema: https://json-schema.org/draft/2020-12/schema in it).

edit: grammar

awwright commented 2 years ago

@karenetheridge So why is it that when I follow any of our published drafts, even the obsoleted ones, that it produces a contradiction with what you're saying? Please cite for me the language you're reading.

And I have to assume draft2020-12. If you don't follow the latest draft then how do we achieve cross-platform compatibility?

Implementations have to be able to handle previous versions of schemas in a uniform way. And schema authors have to be able to write schemas knowing that implementations won't give contradictory "valid" results. The way to do this is by defining in the specification the best current practices for compatibility.

Simply saying "implement multiple versions of the specification" does not accomplish this goal:

Of all of the publications, which are implementations required to follow?
Which ones are optional?
What if the referenced version is unknown?
... Or known, but deliberately unimplemented?
... Or missing?
If I'm writing a schema, how do I ensure it won't generate a false positive on older validators? If I'm writing an implementation, how do I know what the author intended by an unknown keyword?
If I'm writing an implementation, how do I ensure I'm handling obsoleted behavior correctly?
Suppose we publish a version that's outright contradictory with an earlier release (e.g. it defines the same meta-schema URI.) Who "wins"?

You might have reasonable suggestions for these; but the fact remains: different implementations could do it differently, in incompatible ways, and there'd be nothing to say that's wrong. The fact that this isn't written down anywhere, and the fact there's no tests for these, implies this isn't how you read the specification.

And of course, when we publish a draft, it replaces the older ones. And language in the draft warns of this:

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time.

If we want to support backwards compatibility in a cross-platform way, we have to write it into the specification.

handrews commented 2 years ago

@awwright

Error on a missing $schema keyword: This would break a very large number of existing implementations and documents. null wouldn't be valid, but it wouldn't necessarily be invalid either. (And even if we were authoring from scratch, requiring version identifiers in documents is strongly discouraged in Internet media types, and very unpopular among document authors.)

@gregsdennis

I think the intent in that specific paragraph is to say that an implementation can choose which draft it will use to process the schema. I expect most will choose the latest that they support.

@Relequestual

My preference would be to change what's defined here. What I'd really like is for hard error if $schema is not provided, however I understand (with frustration) that it might be impractical to require this.

I am fairly certain that the reason why I wrote that to allow refusing to process the schema (which is explicitly acknowledged, although not in normative language, in the section on default vocabularies), was to avoid mandating that an implementation attempt to process an unknown schema that might be a security concern.

I did not have a particular attack vector in mind for that, and it could be something relatively indirect like "we guessed this schema wrong, and therefore it said the instance was valid when it was not, and then our code depended on the validity in a way that caused a problem."

For some applications, guessing is not an appropriate behavior. But this was all just a vague thought at most at the time.

awwright commented 2 years ago

In https://github.com/json-schema-org/JSON-Schema-Test-Suite/issues/311#issuecomment-1164742431

Again, it was 100% intended that conforming implementations can refuse to process schemas that do not have $schema.

This is not what I have ever understood. This is different than permitting arbitrary behavior, but not by much. Previously, a validator would see a schema like { "type": "string" } and handle it with the standard behavior that "type" has always had. There was no way to write a (compliant) validator any other way.

With this change, according to your intention, a validator has the option of not producing a result at all. This is introducing non-interoperable behavior, because previously the behavior was well defined.

Sometimes breaking reverse compatibility is necessary, and only after extended discussion and examination of usage in the wild. But I don't think we've ever had a case where we walk back an interoperability requirement.

This also seems to be different from your comment above that the intention was to "to avoid mandating that an implementation attempt to process an unknown schema that might be a security concern". If this is the intention, that's different; I think that is completely reasonable. However, it could be better treated with a section on how validators can implement a subset of the specified behavior, if they must do so.

awwright commented 2 years ago

@handrews What would you say is the most narrow set of conditions where a validator can legitimately (i.e. trying its best to maintain interoperability) decline to process a schema?

Julian commented 2 years ago

With this change, according to your intention, a validator has the option of not producing a result at all. This is introducing non-interoperable behavior, because previously the behavior was well defined.

This doesn't sound accurate to me personally, as schemas without $schema are themselves non-interoperable, as they may include behavior that differs across versions, i.e. the entire notion of processing schemas without $schema is itself non-interoperable if you're looking across implementations. (I'm not sure why you pick an example that just so happens to be the same across versions, the point is as an implementer there are others which will not be. Calling such a thing well-defined seems like a significant stretch.)

awwright commented 2 years ago

as schemas without $schema are themselves non-interoperable

Sometimes, but not entirely. Take the example of {type: "string"}, there's still a certain amount of common functionality that could only possibly be implemented one way, and it would be compliant with every specification published to date.

And even to the extent this is true, it doesn't have to be; many behaviors that are currently non-interoperable can be remedied (for example, backwards compatibility).

Julian commented 2 years ago

Right, a particular subset of all schemas may behave the same across all specs. But the entire class of behavior of "process any arbitrary schema" is not, as some schemas do differ, so to me, no new interoperability that wasn't already present was introduced, this was already an issue previously.

awwright commented 2 years ago

I'd like to distinguish between non-interoperable (different behavior across implementations) and non-compatible (defined, undefined, or contradictory across specification versions). When a validator is not required to implement a keyword because it's been removed (or deprecated), some validators might continue to implement it, some might not—that's non-interoperable: Schema authors cannot rely on backwards compatibility for the removed keyword.

Nonetheless, a newer implementation could choose to maintain compatibility and implement it anyways.

So, there's a large class of non-interoperability that can be remedied by specifying how since-removed behavior must be handled.

The only cases where we cannot do this are when we "break" compatibility. We've only identified two of these cases so far, the 1.0 case and the $ref case.

Every other change that we've made, can be backported into earlier versions, or carried to newer versions (when deprecated/removed). This is https://github.com/json-schema-org/json-schema-spec/issues/1242

handrews commented 2 years ago

@awwright

With this change, according to your intention, a validator has the option of not producing a result at all. This is introducing non-interoperable behavior, because previously the behavior was well defined.

I agree with @Julian on this point, particularly regarding the artificiality of picking keywords with consistent behavior to make your point.

But I don't think we've ever had a case where we walk back an interoperability requirement.

What interoperability requirement would that be? While I may have missed it, I have previously looked for any cross-draft requirements and found none. Each draft replaces the previous one, they don't exist in some sort of quantum superposition. And they are drafts (whether we should still be calling them that is a separate concern, for now we are, and the word has meaning in the IETF context). There is no guarantee of interoperability.

In particular, there is nothing at all about implementations making a best-effort guess at how to process a schema when they do not know how it should be processed. It's simply not addressed. A schema that uses:

extends from draft-03
boolean required from draft-03
allOf from draft-04+
$id from draft-06+
numeric exclusiveMinimum from draft-06+

could technically be processed by an implementation that is aware of the union of all drafts, and processes them on a keyword-by-keyword basis. However, that is radically different processing model than I've seen in any implementation (although admittedly I have not surveyed for this exact functionality, and since I know you've implemented stuff, I assume you implemented something along these lines?).

I know that some implementations won't process anything unless you explicitly configure them for a particular draft. As far as I know, all implementations that will attempt to process a schema without either external configuration or $schema assume some specific draft. So there's no question about interoperability, and it doesn't matter whether the schema is {"type": "string"} or something that has changed over time. You get a particular draft, determined by the implementation.

And that, right there, is by definition implementation-defined behavior. I can't be guaranteed that a validator will understand my numeric exclusiveMinimum unless I tell it which draft I am using. For me, "interoperability" is about whether a schema I wrote will be processed as intended by someone else, without me having to tell them the processing rules out-of-band.

It has never been the case that you could rely on some random implementation somewhere processing your schema correctly without $schema. It might work for certain schemas, but that's not because of any sort of normative requirement.

For that matter, given that the spec has never said that an implementation MUST respect $schema, there are implementations that completely ignore it and only respond to external configuration, which is another implementation-defined behavior that has never been specified. There has never been a test regarding a conflict between external configuration and $schema, presumably because there was never a clear requirement regarding that scenario.

However, it could be better treated with a section on how validators can implement a subset of the specified behavior, if they must do so.

Not all implementations are validators. There is no subset of JSON Schema that must be universally accepted by all implementations.

If you want to talk only about validators, then you could construct such a subset. However, I do not agree that a validator should make such a best-effort try. As noted above, it would be a radically different processing model.

It's much simpler to say "If you, as a schema author, want predictable behavior, declare your expected processing rules in the schema".

handrews commented 2 years ago

@awwright

Nonetheless, a newer implementation could choose to maintain compatibility and implement it anyways.

So, there's a large class of non-interoperability that can be remedied by specifying how since-removed behavior must be handled.

Why should this be a goal?

I will also note that in newer drafts, you can theoretically assemble a cross-draft set of behaviors by including non-core vocabularies from multiple drafts. You just can't mix-and-match core.

I'm not sure how well any implementation would support overlapping vocabularies, which you would need to allow, for example, both ways that exclusiveMinimum and exclusiveMaximum work - we deliberately left those requirements vague to get feedback.

IIRC, we were in fact going to write a dialect that included the draft-04 exclusive* behavior in addition to the current behavior for OAS 3.1 in order to maintain compatibility. We defined a vocabulary that had draft-04 semantics wherever that differed, and the meta-schema used a oneOf or anyOf to allow for either syntax for keywords that differed. I also recall that @jdesrosiers tried to implement it and found it problematic.

Ultimately the OpenAPI folks decided to just move to 2020-12 and drop support for conflicting keywords from past releases.

awwright commented 2 years ago

What interoperability requirement would that be?

Whenever we define a specific behavior where there is exactly one correct behavior; that's an interoperability requirement. e.g. there is exactly one correct behavior for {type: "string"}, everyone can rely on everyone else having this specific behavior.

I should add: I'm omitting deprecation or removal (which is also, technically, walking back interoperability, since schemas cannot rely on the behavior being implemented anymore).

Note that this is true for meta-schemas and keywords alike, it's true at all levels. When we publish a specification and the meta-schema no longer includes, say, "extends" — there is no requirement that a validator support it anymore. If a schema cannot rely on other validators supporting "draft-03" or "extends" then it is non-interoperable.

Why should this be a goal?

Because interoperability is a goal. If we're only going to pursue select kinds of interoperability, we must be clear about which kinds we are talking about.

If a software only outputs "draft-03" schemas, and the validator only accepts "draft-04" schemas—can you really call that "interoperable" in any sense?

handrews commented 2 years ago

@awwright

I empathize with your frustration that not everyone sees this thing that you consider a self-evident goal (your answer to "why is this a goal" was "because it is a goal"), as there are a lot of parallels with my frustrations in other areas.

But I just don't understand where this is coming from.

If a software only outputs "draft-03" schemas, and the validator only accepts "draft-04" schemas—can you really call that "interoperable" in any sense?

Why you would expect that to work? I have never heard of any expectation that a given implementation would function across drafts. Was there ever a written requirement for such a thing? Not just an observation that it could be made to work?

If not, are there any popular implementations that do this?

I don't believe the test suite expects this, as it communicates the draft expectation out-of-band via the directory names.

Interoperability and compatibility aside, there's not even any expectation that implementations support older drafts at all. Plenty have dropped support for older drafts, or never implemented any older than whatever was current when they started.

awwright commented 2 years ago

Why you would expect that to work?

Because I generally expect some level of backwards compatibility when a revision to a protocol or file format is published. (Backwards compatibility is usually interoperable, e.g. in HTTP/1.1, HTTP/1.0 is handled in a standardized way.)

But in recent releases of JSON Schema, there is no requirement that the previous meta-schema be supported. Support is at the whim of each implementation.

I could release a validator that accepts only 2019-09, and the very next release could accept only 2020-12, and there'd be nothing to say this is wrong. But this seems wrong.

Julian commented 2 years ago

But in recent releases of JSON Schema, there is no requirement that the previous meta-schema be supported. Support is at the whim of each implementation. I could release a validator that accepts only 2019-09, and the very next release could accept only 2020-12, and there'd be nothing to say this is wrong. But this seems wrong.

This is true and yet does not mean retroactively one can say "this draft should work this way because we should be backwards compatible".

If we want to agree on principles we have to do so ahead of time, and structure changes around them, not declare some axioms afterwards and try to fit things retroactively.

I have to say I really do sympathize with Henry on some of these things -- none of the development was done in a vacuum, if we don't like the way a draft works we either get to speak up at the time or decide to change what we're doing now while recognizing that we have what we have right now.

handrews commented 2 years ago

I could release a validator that accepts only 2019-09, and the very next release could accept only 2020-12, and there'd be nothing to say this is wrong. But this seems wrong.

There has never been anything to say this is wrong. There has never been anything to say that multiple drafts need to be supported. There has never been anything to say, absent any in-document or out-of-band instruction regarding which draft is in use, that implementations are obliged to make a best-effort guess.

Nor has there ever been anything to forbid it, and that, too, is still the case. The only thing that has changed is to be more explicit about the lack of requirement.

If I am wrong, please point to these pre-2019-09 requirements.

handrews commented 2 years ago

If we want to agree on principles we have to do so ahead of time, and structure changes around them, not declare some axioms afterwards and try to fit things retroactively.

Yes, I'm 100% fine if the project, collectively, decides that cross-draft/version/whatever compatibility/interoperability is a goal that we want to have. But that's a different discussion than claiming that it has always existed and now has been broken. I doubt I would be on board with exactly what you're proposing for such interoperability, Austin, but I'd be entirely willing to discuss it and might end up convinced.

awwright commented 2 years ago

"this draft should work this way because we should be backwards compatible".

Can you be more specific?

There has never been anything to say this is wrong

Through draft-05 at least, a validator would not be able to reject a schema just because it didn't understand the meta-schema. Or at least, this was not required. It wasn't even suggested. Some amount of reverse compatibility was the default. It didn't even exist until draft-03, before which, new drafts introduced keywords like "divisibleBy" and "uniqueItems".

Now, maybe the behavior was under-specified, and it deserved a stronger definition. But this has reduced backwards compatibility, and this should be closely examined.

Julian commented 2 years ago

Can you be more specific?

I'm responding to comments like:

I generally expect some level of backwards compatibility when a revision to a protocol or file format is published.

But this seems wrong.

or implicit claims in

Through draft-05 at least, a validator would not be able to reject a schema just because it didn't understand the meta-schema

that because something worked a certain way until a certain point that it works that way forever.

We have a specification -- what you or I think is right or wrong based on external knowledge isn't relevant -- you need to cite where in the specifications the things you're claiming are written.

Julian commented 2 years ago

Without that of course your (or my, or anyone's) opinions are of course valuable to inform what we do going forward but have no bearing whatsoever on what's already written down clearly.

awwright commented 2 years ago

The context of those comments are things I would expect if interoperability is a goal. Where one party can rely on other (compliant) implementations to have a predictable behavior.

I could release a validator that accepts only 2019-09, and the very next release could accept only 2020-12, and there'd be nothing to say this is wrong. But this seems wrong [because they are not interoperable at all].

I generally expect some level of backwards compatibility when a revision to a protocol or file format is published [so that there is an upgrade path that authors can follow.]

If I send out a schema to two clients, Alice only accepts 2019-09, Bob only accepts 2020-12, then JSON Schema doesn't seem very interoperable! Now I don't expect everyone to be on the latest version of software, but I do expect to be able to fall back onto a version that everyone supports. But since 2019-09, JSON Schema doesn't proscribe backwards compatibility, at all.

Should interoperability—other implementations having predictable behavior—be a goal?

Do you agree with the conclusions I'm drawing from this definition?

Julian commented 2 years ago

I don't know what you mean by "conclusions". Draft 2019 and 2020 work the way they say they work. That's not mutable. No logical argument, or definition, or contortion can change what they say. You can work to change how draft 2022 works by trying to convince others to have the same definition of interoperability that you do, or to value interoperability more than it has been, either of which may explain why the 2 drafts don't work the way you expected them to.

You cannot change how the existing drafts work, and you seem to continue to try and use a logical argument to affect how 2 clear specifications work. I'm trying to point this out simply to try and make it plain how futile that line of thought is, it's just not how specifications work. If or once we get past that we can worry about more useful things like what we want to be, and stop trying to push walls for things that already are the way they are.

You also continue to make plainly incorrect statements, which is a frustrating way to have discussions:

But since 2019-09, JSON Schema doesn't proscribe backwards compatibility, at all.

No draft prior provided backwards compatibility guarantees. You are aware of this. There are at least 2 examples which I thought of very immediately of this when prompted. Please stop painting these two drafts as the first deviation. They deviate in a specific way you disagree with. I disagree with the "type": "integer" change and the "prefixItems" change and a bunch of others but that matters 0, the changes are done. If I want to affect future changes, I can speak up, as can you. What principles we want to hold true are irrelevant for affecting how previous drafts work, only new ones. The draft is normative, the principle is not.

awwright commented 2 years ago

@Julian There's a serious misunderstanding here. The theme of my posts today has been: Is interoperability a goal, do the drafts satisfy that goal, and what improvements can we make?

Draft 2019 and 2020 work the way they say they work. That's not mutable. You cannot change how the existing drafts work

This is worded like I disagree, but I'm not sure what you're arguing against.

No draft prior provided backwards compatibility guarantees

I did not mean to say that e.g. draft-03 proscribed reverse compatibility. Backwards compatibility was nonetheless supported, as a consequence of how it's written; and if we want to support the same kind of backward compatibility in the future, we will have to proscribe additional behavior.

For example: Except for some keywords that were removed, a draft-03 validator could be fed a draft-02 schema and it would be guaranteed to work—$schema was not a keyword then.

In contrast, you cannot take a 2019-09 schema and reliably expect it to work in validators compliant with subsequent drafts. Such a validator could reject all older schemas and this would be legal (that is the topic of this issue). This is what I mean when I say our support for backwards compatibility has been falling, and should be re-examined in the light of $schema and meta-schemas.

handrews commented 2 years ago

@awwright OK, after going for a walk and coming back to see these last couple of comments, I decided to do some serious digging.

You mentioned draft-05 after my last comment, and I had not looked at that or draft-06 when I was looking at $schema for the test suite discussion. I looked briefly at draft-07, but just to make sure it had not said anything new about unrecognized meta-schemas, which it did not. I did not read the text of draft-07 (either version) as closely as I did that of draft-03, draft-04, 2019-09, and 2020-12 (both versions).

Imagine my surprise when I discovered that there was in fact normative wording regarding past drafts in draft-05 through draft-07. From draft-05 (the relevant wording is carried through draft-07 unchanged):

Values for this property ["$schema"] are defined in other documents and by other parties. JSON Schema implementations SHOULD implement support for current and previous published drafts of JSON Schema vocabularies as deemed reasonable.

First of all, why did you not simply quote and link this specification text? I asked repeatedly for any sort of written evidence of compatibility requirements. All you needed to do was show this to me, and we could have saved a tremendous amount of effort and frustration. I had been working from the draft-03 and draft-04 text, which has no such requirements, and since you were referencing those as well, I assumed I was looking at the right thing.

Anyway, I was wrong in my assertion that there was never any language around this. There was, for four drafts (wright-00 and -01, handrews-00 and -01).

Having discovered that, I went and hunted down the PRs where you added that language, and where I removed that language.

@awwright, you added that language in PR https://github.com/json-schema-org/json-schema-spec/pull/50 "Fix a lot of id/ref/dereferencing problems", which you posted on Sept. 15, 2016 at 1:31PM PDT, and merged less than two days later with no reviews and no comments at 8:50 AM PDT. I went back through my email from around that time and could not find any relevant discussion on the old mailing list, although I did not spend too much time digging.

But it's pretty clear that even if that then-new backwards-compatibility SHOULD was discussed at some point, it was definitely not subject to a typical PR review and approval process. That doesn't invalidate it – we had not yet agreed on that sort of process question. But it's relevant in terms of how much you can claim to have gotten buy-in for it.

Furthermore, it's a SHOULD rather than a MUST, and comes with a "deemed reasonable" qualifier. That is hardly an ironclad guarantee of compatibility-interoperability. And the language does not give any guidance on what an implementation ought to do if it encounters a $schema of a draft it has not implemented.

Looking at that text, I don't see any reason that an implementation couldn't refuse to process such a schema. If it did not refuse to process it, then it would obviously be processing it by mismatched rules, and certainly by draft-07 that was pretty problematic. I know you don't think so, but I am not aware of any draft-07 implementation that would correctly handle draft-04's id and exclusive*, for example, while it was processing using draft-07's rules.

The case of not having a $schema at all does not seem to lend itself to interoperability either, as again AFAIK implementations assume a default draft, and processing by the wrong draft will produce an incorrect outcome. Definitely not interoperable.

Having dug that up, I went to find out where I'd removed that language, which was in PR https://github.com/json-schema-org/json-schema-spec/pull/671 '"$vocabulary" and basic vocabulary support.' opened on Nov. 10, 2018.

Initially, this PR was structured as a few commits that were each logical steps in the larger change, with the recommendation that each commit be reviewed separately. The change to $schema was in the first commit, which just rewrote that whole $schema section.

The PR was open for over a month before being merged on Dec. 17, 2018. There were a lot of review comments and updated commits. And, of course, it was nearly another year before 2019-09 actually went out.

The initial PR did not remove the SHOULD regarding past drafts. I went through every single resolved comment, and eventually found the discussion with Jesús González where I decided to remove it (you'll have to click "show resolved" to see it).

On Nov. 29th at 6:16 PM PST I wrote:

I ended up just removing this paragraph. It had more to do with when $schema was viewed primarily as a draft version marker, although it's actual described functionality has always been broader. Implementors will implement whatever versions are available and in demand, without any directive from the spec.

I also am interested in seeing whether specialized implementations choose to only support certain standard vocabularies. I'd like to see what happens without any advice from us for now.

A few hours later, at 3:34 AM PDT on Nov. 30th, you (Austin) commented on the PR. The timing is unfortunate, as I assume you reviewed it before I made that change.

However, you would have just gotten an email notification about the updated commit, which was pushed on Nov. 29th (GitHub's UI says Dec. 16th due to some rebase weirdness, but the git log shows the correct timestamp), and my comment on it.

I replied to your comment later on the 30th, but you never replied or otherwise interacted with the PR after that, so I don't know what you did or didn't see. But you were definitely aware that there was a major change up for review in this area, with active discussions and updates in response to feedback. And it stayed up for another two weeks or so.

Ben approved the PR on Dec. 4th, Greg on Dec. 16th, and I merged it on Dec. 17th.

Of course, I've missed stuff in reviews that I later realized were important. Missing the change in the PR does not automatically invalidate your concerns.

But please, stop treating this change in 2019-09 like a mistake that just slipped in. The PR was open for a long time, and had a lot of eyes on it, not just from the JSON Schema core team, but from other community members including one from the OpenAPI core team.

Regardless, I continue to assert that 2019-09 did not actually break anything. Your compatibility requirement was not strong enough for true interoperability, as it offered no guidance for handling unsupported drafts. Even though it's not at all clear what an implementation ought to do in such a case, and it's quite likely that without proper support for a draft, the result of evaluating it will often be wrong. It doesn't matter at all that {"type": "string"} would work anyway, because most schemas are more complex than that.

I'll also note that 2019-09 was the first draft to provide explicit guidance on cross-draft compatibility, albeit only for definitions/$defs and sort-of for dependencies/dependentRequired/dependentSchemas.

I am happy to discuss what sort of compatibility you think should be present in the spec in the future.

What we need to debate now is whether your concept of compatibility is the right one. I do not think it is, because even when your SHOULD was in the spec, as far as I can tell no one implemented the sort of cross-draft interoperability you think was implied. Please do feel free to demonstrate otherwise by linking to such an implementation.

I also don't think that a directive to implement versions back to draft-04 or whatever would go over well with the larger community, particularly folks who wrote their implementations recently enough that they only support 2019-09 and later.

I have thoughts on what sort of compatibility is needed, but I'm exhausted by this discussion right now and don't expect to get back to it in the necessary depth to explore that until next week sometime.

Relequestual commented 2 years ago

I am happy to discuss what sort of compatibility you think should be present in the spec in the future.

Yes, this. I don't see any value of discussing previous or even the current draft in terms of draft version compatability.

(Honestly, I thought the "compatability" discussion was clamping down on different implementations of the same drafts of JSON Schema, not cross-draft support for when schemas don't specify the draft).

The previous and current versions of JSON Schema are AS IS. There's no changing them. I know others have raised this as it seems like that's what @awwright you were proposing, but then you said you were not. Given multiple people think this is what you were proposing, whatever you're trying to say maybe doens't have enough context. It's not clear.

It feels like to me the majority of this discussion is off-topic (if we even agree on the purpose of the discussion, which I'm not sure we do).

Cross version compatability moving forward is something we have said we want to look at, but including previous drafts in that discussion feels... pointless.