JSON equality vs JSON Schema equality

gregsdennis commented 6 years ago

JSON defines equality in arrays as being sequence-equal. That is, each array must contains the same collection of elements in the same sequence.

However, in JSON Schema, two schema could be equivalent, even if their JSON representations are not. For instance:

{ "type" : [ "object", "array" ] }
{ "type" : [ "array", "object" ] }

These are not JSON-equal, but they are equivalent JSON Schema.

Is there any verbiage in the documentation or specification regarding this? It seems counter-intuitive that a schema could be equally represented by two unequal JSON documents.

This can also cause pains for implementations.

handrews commented 6 years ago

@gregsdennis I'm not sure what you are looking for here. There are many ways to write functionally equivalent schemas. To determine whether two schemas are equal, you would need to normalize them in some way. However, I would not do that, I would simply assign $ids appropriately and rely on that. Even when $id is not explicitly assigned, it is easily calculated from a base URI using JSON Pointer fragments.

gregsdennis commented 6 years ago

My point is that different implementation may write a given schema differently, and I couldn't find anything in the documentation or specification that indicated rules regarding equivalent/equal schemas.

To further the point, though functionally equivalent, I wouldn't necessarily consider these as "equal" schema:

{ "type" : [ "object", "array" ] }
{
  "oneOf" : [
    { "type" : "object" },
    { "type" : "array" }
  ]
}

I'm not sure what you mean by assigning $ids. That doesn't apply to my examples.

awwright commented 6 years ago

Is there any reason to believe the behavior would be different?

While those are technically two different schemas, the behavior (as a consequence of the definition of the keyword) will still be the same. I don't really think this needs elaboration, unless there's something specifically suggesting otherwise?

gregsdennis commented 6 years ago

In the first example I gave, I would consider the two schemas as equal. In the second, they are not, even though they are functionally equivalent.

The problem with the first example is that the JSON is different even though the schemas are the same. I'm not saying that a schema has to be normalized in any way. However, I do think that there needs to be some definition of schema equality beyond "functionality equivalent."

mokkabonna commented 6 years ago

Stumbled upon this issue. I have written a npm module that compares 2 json schemas according to the definition of the keywords. So for example type and enum are compared as arrays where sort does not matter. https://www.npmjs.com/package/json-schema-compare

As a result of that work, my opinion is that there already exists a definition of schema equality, just not explicitly defined on the schema level. But keywords have clearly defined when things like sort matter or does not matter. Also when certain values are equal, like for minLength, minItems and minProperties undefined and 0 are equal.

The schema equality rules is a sum of the keyword equality rules. This is enough to evaluate if two schemas are equal or not on a basic level. But of course more complicated schemas that are not structually equal can also be equal, but that is much harder to determine.

handrews commented 6 years ago

@gregsdennis, @mokkabonna's approach is what I would go with.

Let's take a step back. Why do you need a formal definition of schema equality? There are really two questions that you can ask:

Do these schemas describe the same subset of JSON? (functional equivalence)
Are these two schemas exactly the same schema?

In some cases, such as {"type": ["array", "object"]} and {"type": ["object", "array"]}, they are functionally equivalent, but (as @mokkabonna describes) can easily be transformed into exactly equal schemas. In other cases (the variation with oneOf) that transformation is possible but non-trivial.

But the only need I've ever had for comparing schemas is whether it's a schema that I've seen before. I use this for loop detection when following $refs, among other things. For that, I examine the explicit $id, or if there isn't one of those, I calculate such a value from the base URI.

Yes, it's possible that someone might assign the same $id to different schemas, or give functionally equivalent schemas different $ids (explicitly or implicitly). But it's possible to write garbage in nearly any system, and I don't think that it is JSON Schema's responsibility to prevent that.

JSON Schema provides a way for authors to assign identifiers (specifically URIs) to schemas, and that is how you determine whether you are looking at the "same" schema.

What else do you need to do, and why?

gregsdennis commented 6 years ago

The use case that highlighted this issue for me was a test that I run as part of my implementation. The test performs the following steps:

Download the metaschema as JSON
Deserialize into a schema
Validate the JSON against the schema (self-validation)
Serialize the schema back to JSON
Compare the new JSON against the original

It's this last step that fails because, due to how I've implemented the type on the schema, it doesn't always write the values in the same order that they were read.

I understand that this reveals a flaw in my implementation, but it also highlights the fact that arrays in schema are unordered, whereas arrays in JSON are ordered. And if JSON Schema is to be represented by JSON, then this difference should be resolved (or otherwise documented).

handrews commented 6 years ago

Why do you even need to do step 5? What benefit does it have? If it is purely testing, then you should either normalize as suggested by @mokkabonna, or use a reversible deserialization process that makes note of important orderings or whatever else you need.

then this difference should be resolved

I don't know what that would mean

or otherwise documented

It is already documented on a keyword by keyword basis, which is the only possible way to do it. JSON Schema is a media type with an open-ended set of vocabularies, which may be further extended or restricted by users.

handrews commented 6 years ago

To explain a bit more about my resistance here, it has to do with cost/benefit tradeoffs.

Your use case seems very specific and obscure, and how to "fix: it beyond the existing per-keyword document is unclear. Imposing an ordering on type and similar keywords makes it more difficult to write schemas and provides no benefits to the vast majority of users.

Saying things like "all arrays are unordered" is untrue- the array form of items is ordered, for instance, and while it's not particularly important for type or enum as defined, it's easily possible that someone may construct an extension keyword that MUST be treated as an unordered array.

I just don't see what could be done that provides more benefit than adds complications for the vast majority of use cases.

gregsdennis commented 6 years ago

The fifth step is the round-trip check. I shouldn't change anything through serialization and deserialization.

It still seems odd to me that the a single schema can be identically represented differently in JSON, and I think there should be a note about equality somewhere. Maybe it's simply that JSON doesn't have unordered, non-keyed collections.

If you think that this is covered in the existing wording, I won't push it further.

handrews commented 6 years ago

Two closing thoughts:

You only "change" things if you re-order the arrays during processing. If you are careful to preserve the array order then your re-serialization should work. At least to the extent that re-serializing JSON ever works as a round trip, due to object keys being unordered- some implementations will preserve that, others will not, so this overall problem exists in JSON to some degree anyway.
The idea of an ordered annotation keyword has come up repeatedly but we've never added it. I'm not sure why, I don't think there were objections, just non one has prioritized it. If we don't have an issue for that, I will file one. If we had that keyword as part of JSON Schema, then we could use it in the meta-schema, which would inform processors that instances of the keywords with the same content except different ordering are, in fact, equivalent.

gregsdennis commented 6 years ago

I like the idea expressed in your second point.

json-schema-org / json-schema-spec

JSON equality vs JSON Schema equality #474