Remove the notion of "canonical URIs" in favour of boundaried schema resources

Relequestual commented 2 years ago

I feel like this could be a good step forward. "Canonical URIs" is often confusing. We can simplify this by changing the language to talk about schema resource boundaries and not being able to reference across those boundaries without using the correct base URI.

I propose adding this to the draft-next milestone.

Would you be happy with something that reworked this to make addressing schema resources with non-canonical URIs something where the "behaviour is undefined"?

For the patch release, anything is an improvement, but yes I would like to see it described as "undefined". For the future, we need to stop using "canonical" in the spec. It's not the right abstraction. It's confusing, complicated, and I think it misses the point. An embedded schema is it's own independent schema distinct from it's parent (think iframe). A pointer can only point to a location within a single schema resource (I can't craft an xpath to point to a location within an iframe). That's the whole concept.

I don't see why implementations have such a hard time with this issue

It's not necessarily hard, it's just unnecessary complexity. It's much simpler to not have to track base URI and dialect changes depending on where you are in the schema. With strict boundaries, every schema has one base URI and one dialect no matter where you are. I'm lazy, I don't want to write code to track those changes when there's a neat and simple alternative conceptual model that doesn't require additional code and supports everything schema authors need. I just break down compound schemas when they are loaded and then I don't have to worry about anything changing. It's not even extra work because I have to break down the compound schema anyway for validation against the meta-schema.

Originally posted by @jdesrosiers in https://github.com/json-schema-org/json-schema-spec/issues/937#issuecomment-1017096308

handrews commented 2 years ago

@jdesrosiers If I understand things correctly, this is the main practical implication here:

I just break down compound schemas when they are loaded and then I don't have to worry about anything changing. It's not even extra work because I have to break down the compound schema anyway for validation against the meta-schema.

We already have this wording in §9.3 Compound Documents:

Each embedded Schema Resource MUST be treated as an individual Schema Resource, following standard schema loading and processing requirements, including determining vocabulary support.

This might need some clarification, including on how a context schema resource (one in which other schema resources are embedded) should be treated when loading, e.g. is the embedded schema resource stripped out (in which case JSON Pointer fragments that cross the resource line will automatically no longer work) or left in (in which case they may or may not work depending on how the implementation handles them)?

If it is stripped out, do we have "MUST be replaced with a $ref to the resource's $id" which would ensure that if anyone referenced the embedded resource by the context's JSON Pointer fragment, or if the resource was embedded under an applicator rather than location that it would continue to work with only a slight change in the evaluation path (due to inserting a dynamic scope for the reference)? Basically, this would convert the resource's retrieval URI (the context base URI + JSON Pointer fragment) into the URI for a reference to the resource.

We would probably also need language about copying the context's $schema into the root object of any embedded resource that lacks a $schema.

In §9.3.3 Validating (meaning "Validating Compound Documents"), we have:

Given that a Compound Schema Document may have embedded resources which identify as using different dialects, these documents SHOULD NOT be validated by applying a meta-schema to the Compound Schema Document as an instance. It is RECOMMENDED that an alternate validation process be provided in order to validate Schema Documents. Each Schema Resource SHOULD be separately validated against its associated meta-schema.

This would follow pretty naturally if compound schemas had to be split up when loaded.

The other paragraph in this section is:

A Compound Schema Document in which all embedded resources identify as using the same dialect, or in which "$schema" is omitted and therefore defaults to that of the enclosing resource, MAY be validated by applying the appropriate meta-schema.

This would presumably be dropped, unless we only want to mandate splitting of embedded resources with different $schema values than their context resources?

We might also need to update §9.4.2 References to possible non-schemas because now the case of embedding a schema resource in a location not known to be a schema (e.g. because there is a keyword involved that is not recognized as an applicator or location keyword) also has significant implications: it will not be recognized and therefore not split out from the compound document.

Does this cover the necessary changes? I would be in favor of this, as it simplifies the evaluation process and eliminates ambiguous JSON Pointer fragment behavior at the expense of a slight increase in schema load complexity. That seems like the right sort of trade-off, as a schema may be evaluated many times for each time it is loaded. It also helps remove the need for treating schema+instance and meta-schema+schema evaluation differently, which would be a good thing (the other part of what's needed for removal would be handled by #1281).

jdesrosiers commented 2 years ago

I'm a bit tired and that part of the spec isn't fresh in my head right now, but it doesn't seem necessary for there to be that many changes. I wasn't proposing that any functionality actually change, just the way we describe it.

I always try to avoid prescribing implementation details and I think much of what you're saying here sounds like implementation details. It seems to me that it should be enough to define embedded schemas as independent entities that are separate from the parent schema (which we do) and that the behavior of JSON Pointers that cross schema boundaries is undefined (which we sort of do in a confusing way). As long as you follow those constraints, you can implement it however you want. Breaking down compound schemas at load time seems the easiest thing to do to me, but I don't think it matters if someone handles it a different way as long as the behavior is the same.

handrews commented 2 years ago

Breaking down compound schemas at load time seems the easiest thing to do to me, but I don't think it matters if someone handles it a different way as long as the behavior is the same.

Agreed, I should have framed this more clearly as "is this an approach that captures what you want" rather than "is this the approach we should mandate."

Whether it happens on schema load or at some other time, I think we do need to decide if the resources are actually split and stored as separate in-memory entities, or if they are left as one entity and the URI storage just points to the embedded location in the larger document.

that the behavior of JSON Pointers that cross schema boundaries is undefined (which we sort of do in a confusing way

If they are actually split, then:

schema-crossing JSON Pointer fragments will just fail
assuming we want the pointer URI from the containing resource that points to the root of the embedded resource to function as the embedded resource's retrieval URI, we need to figure out how that works if they are split. Or if it works, and if it needs to.
we can drop §9.3.3, because by the time you validate something it is not a compound document anymore (this trades the current special meta-schema validation process for a special loading-schemas-as-instances process)

If they are not split, then things pretty much stay the same, unless we want to outright forbid schema-crossing pointers from working (currently they are allowed to work, just not interoperably).

jdesrosiers commented 2 years ago

I think we do need to decide if the resources are actually split and stored as separate in-memory entities, or ...

I don't think we do need to decide. I think this is an implementation detail. Even if we choose to make the requirement more strict than just, "it's undefined", we don't need to care if the schemas are split up at any point. Someone could alternatively write a custom JSON Pointer implementation that is aware of schema resource boundaries and knows to stop at those boundaries.

we can drop §9.3.3, because by the time you validate something it is not a compound document anymore

I think we could drop that section without any changes. It's a natural consequence of constraints that are already well defined. It was for this reason that I recommended that that section not be included when it was first written. It's a helpful note, but it doesn't add any new constraints that need to be implemented.

handrews commented 2 years ago

OK, so there's really no functionality change at all here. Implementations are just as free to bleed over schema resource boundaries as before.

jdesrosiers commented 2 years ago

OK, so there's really no functionality change at all here. Implementations are just as free to bleed over schema resource boundaries as before.

Yes, my intention wasn't change anything, just to describe it differently. Of course we could decide to be more strict about the boundary, but that would be a different issue.

json-schema-org / json-schema-spec

Remove the notion of "canonical URIs" in favour of boundaried schema resources #1183