json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.76k stars 265 forks source link

Machine-readable dialect (not vocabulary) definition document #1423

Open gregsdennis opened 1 year ago

gregsdennis commented 1 year ago

IMPORTANT: This changes how meta-schemas are organized but not really how they work.

Relevant to this discussion:

I've been thinking about all of these ☝️ things together to get a larger picture of where vocabularies could go. The discussions I've been a part of have all described a vocabulary definition file as serving several purposes:

Impact to the Meta-Schema

The ⭐ in particular is where the meta-schema is changed. Currently the schema for a keyword's value is contained in the meta-schema body, generally under a properties keyword. However, if the vocabulary definition file carries and enforces the schema for a keyword's value, then the meta-schema's entry is redundant. This means that the entire properties keyword for a meta-schema could be removed as it's all in the vocab files.

I don't think this is a breaking change, however. A significant reorganization, sure, but the functionality is all still there. Moreover, we can make this change iteratively.

Suppose the only change we make to how the meta-schema is processed is that $vocabulary acquires some validation behavior, applying the keyword schemas from all of the vocabularies it lists (it becomes an in-place applicator similar to properties). Ideally, those keyword schemas would be the same as what's already in the meta-schema. However, even if they're not, the meta-schema is defining a dialect by virtue of declaring a set of vocabularies. In doing so, it's free to apply additional constraints to keywords.

For example, consider a modified Validation meta-schema where I've required that enum have unique values (which isn't a current requirement):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "...",
  "$vocabulary": {
    "https://json-schema.org/draft/2020-12/vocab/validation": true
  },
  // ...
  "properties": {
    // ...
    "enum": {
      "type": "array",
      "items": true,
      "uniqueItems": true
    },
    // ...
  },
  // ...
}

enum, as defined in the vocabulary, doesn't have the uniqueness constraint. This is actually possible now: the above meta-schema should be supported without any issues.

Now consider adding in-place-applicator / assertion functionality to $vocabulary which (for enum) enforces the type and items constraints but not uniqueItems. The functionality of this meta-schema is unchanged.

Going further, we could change the original Validation meta-schema to this:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://json-schema.org/draft/2020-12/meta/validation",
  "$vocabulary": {
    "https://json-schema.org/draft/2020-12/vocab/validation": true
  },
  "$dynamicAnchor": "meta",
  "title": "Validation vocabulary meta-schema",
  "type": [
    "object",
    "boolean"
  ]
}

We don't need properties because that's only defining the keywords, which are now defined in the vocabulary document identified by https://json-schema.org/draft/2020-12/vocab/validation, and we don't need $defs because that was only used to support the subschemas in properties.

In fact we may not even need the vocab meta-schemas anymore. Because the top-level meta-schema lists all of the vocabularies, it would automatically perform all of the validation that the vocab meta-schemas currently provide. We could remove the allOf making it just:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://json-schema.org/draft/2020-12/schema",
  "$vocabulary": {
      "https://json-schema.org/draft/2020-12/vocab/core": true,
      "https://json-schema.org/draft/2020-12/vocab/applicator": true,
      "https://json-schema.org/draft/2020-12/vocab/unevaluated": true,
      "https://json-schema.org/draft/2020-12/vocab/validation": true,
      "https://json-schema.org/draft/2020-12/vocab/meta-data": true,
      "https://json-schema.org/draft/2020-12/vocab/format-annotation": true,
      "https://json-schema.org/draft/2020-12/vocab/content": true
  },
  "$dynamicAnchor": "meta",

  "title": "Core and Validation specifications meta-schema",
  "type": ["object", "boolean"]
}

(I've also removed the deprecated keywords listing.)

Adoption

First of all, we've agreed that vocabularies and the $vocabulary keyword are (at best) unstable, so modifying it (even in a breaking way) isn't out of the question.

Adding in-place-applicator / assertion behavior to $vocabulary in the way described above isn't a breaking change as long as we copy the keyword schemas correctly.

Later, once $vocabulary is promoted to being a stable feature, we can update the meta-schemas to remove the redundancies.

Readability and Accessibility

There is an issue of readability and accessibility when all of the keywords are defined in vocab files. While most people would be used to just looking in the meta-schema to see what keywords are available and how they're defined, now they'd have to follow another file reference to get that same information.

I don't think this is a big issue, though, and people will eventually get used to it.

On the other hand, creating a new meta-schema is immensely easier: you just list the vocabularies you want, and everything else is taken care of.

Automatic Support for Undefined Keyword Checking

With this in place, implementations will be able to look at the vocab files to see if and how a keyword is defined.

Further, the implementation would be able to detect trying to circumvent the "keywords must be defined in vocabs" requirement by defining a new keyword directly in the meta-schema. Currently, trying to do this is troublesome for implementations (annoying but not impossible).

(There may be some intersection here with x- keywords, but I haven't thought about it too hard.)

$vocabulary Requires Special Treatment

Currently $vocabulary is only to be processed when the schema that contains it is being processed as a meta-schema. I don't think this should change as it only defines what keywords the instance (another schema) can use.

In this way, maybe it does break the nice symmetry we have around "a meta-schema validating a schema" is just "a schema validating an instance." But it could be argued that such symmetry was broken when $vocabulary was introduced.

It may have an impact on the Test Suite since we do have a number of tests that validate schemas based on the meta-schema, and they'd need to be updated to pass along the context of "this is a meta-schema evaluation" in order to get the validation result from $vocabulary.

Out of scope

I haven't addressed

I'd like to get the concept defined before we start considering mechanics.

Julian commented 1 year ago

You wanted me to respond specifically to the

It may have an impact on the Test Suite since we do have a number of tests that validate schemas based on the meta-schema, and they'd need to be updated to pass along the context of "this is a meta-schema evaluation" in order to get the validation result from $vocabulary.

paragraph here right?

I don't think any changes need to be done for historical drafts as they work the way they currently work, any changes strictly apply to future releases.

Isn't it a bit too early to think about precisely how to restructure the suite until we nail down how we want the feature to work (which is the purpose of this issue otherwise, no)?

But if you're asking for agreement strictly on whether I think it's useful for us to have a way to have tests which specifically indicate they're testing validity of schemas, yes that I certainly already agree with, as we already have a use and need for such a thing given that the definitions of "valid schema in version XYZ" and "schema valid under XYZ's metaschema" are not the same (this was an old discussion about it, though we never made progress), and in general implementations don't do anything besides the latter because the former is complex and has no test cases -- so I think we already have this need today, and definitely am ok with it if it's even more needed in the future.

Lemme know if I missed the point though on what you were hoping for feedback on.

Julian commented 1 year ago

(Posting this separately from the above as it's not related to testing) but it seems odd to me at first glance to put validation in $vocabulary, and as you say to thereby duplicate stuff that currently lives in the metaschema. To me a less drastic change is simply to have metadata indicating important things like "where are subschemas in this keyword", something that today lives nowhere.

So for properties you'd indicate "subschemas live in my values", for instance.

The rest of the ideas here I have to mull on I suppose -- I don't know that I understand from first read what advantage there is in changing $vocabulary in this way, rather than introducing a new keyword with the semantics we want.

EDIT: ok, I think you're trying to answer the latter with:

On the other hand, creating a new meta-schema is immensely easier: you just list the vocabularies you want, and everything else is taken care of.

jdesrosiers commented 1 year ago

There was a purpose to decoupling the semantics of a vocabulary with it's syntax. For example, if you want to create a dialect that uses the type keyword, but not the array form (OpenAPI 3.0), you can use the standard vocabulary with a custom meta-schema. The type keyword in that dialect has the same semantics as the standard version, but with restricted syntax.

Moving the meta-schema into the vocabulary definition affects this property of the vocabulary system. It can still be done, but it would be a little different. You would need to include your custom syntax directly in the meta-schema. The result would be that both the default and the custom schema are applied. That can lead to some duplication, but might also be a good thing because it makes it impossible to define a syntax that contradicts the default (for example, using a type the original syntax doesn't support), which would be a good thing. (Personally, I always thought things like that should just be defined as distinct keyword. So, I'm not concerned if we end up loosing the semantics/syntax decoupling in the end.)

In this way, maybe it does break the nice symmetry we have around "a meta-schema validating a schema" is just "a schema validating an instance." But it could be argued that such symmetry was broken when $vocabulary was introduced.

I don't see $vocabulary as it currently is as creating any special case when validating a schema against a meta-schema. $vocabulary is just an annotation that doesn't affect validation. I see it as two types of evaluation of the schema. One evaluation is determining vocabularies, which uses the $vocabulary annotation. The other evaluation is validation, where $vocabulary has no effect.

I think having the schema in the vocabulary definition would effectively be the same thing except that $vocabulary isn't just an annotation anymore, it's also an applicator. However, the lack of symmetry between defining a schema and defining a meta-schema feels awkward. I'm not sure how I feel about that.

I should mention that it's already the case that we need to handle validating a schema against a meta-schema differently than a normal instance against a schema. If validating a Compound Schema Document that includes an embedded schema with a different dialect than the parent, simple validation against a meta-schema doesn't work. You need to disassemble the bundle and validate each Schema Resource individually.


One very important thing that I think the Vocabulary System is missing is the ability to declare the use of a keyword or vocabulary in the schema without needing to construct a whole new dialect. Constructing a custom dialect is too much to ask of users who just want to use one keyword in one schema. I'm not sure that introducing that functionality is something we can fit into the current vocabulary system. It might need drastic changes. So, my concern is, is this proposal an incremental improvement to a system that's ultimately a dead end? I think we need to take a step back, identify all the things we want out of a vocabulary system, and determine if the current approach is viable or we need to try something different. If we determine that it is viable, then I'd feel a lot better about working on incremental changes like this.

gregsdennis commented 1 year ago

To me a less drastic change is simply to have metadata indicating important things like "where are subschemas in this keyword", something that today lives nowhere. - @Julian

Including that information is definitely part of this proposal; it's just an undefined part right now. But, yes, it definitely needs to be included in the vocab file.

I think performing this as a multi-step process is a good thing (iterative changes and all that). If you think defining keyword meta-data and moving the keyword meta-schemas need to be separate steps, I'm okay with that.

gregsdennis commented 1 year ago

First of all, we've agreed that vocabularies and the $vocabulary keyword are (at best) unstable, so modifying it (even in a breaking way) isn't out of the question.

Would people feel any better about this if, instead of changing $vocabulary, we deprecated it and replaced it with $dialect using the definition I listed above (a list of vocabulary IDs which point to files that contain keyword definitions)?

I think that "vocabulary" is an overloaded term at this point, anyway. Really, because we determined that a dialect is defined by a collection of vocabularies, what the $vocabulary keyword means is "dialect". This brings that meaning into the meta-schema.

gregsdennis commented 1 year ago

I'd like to leave this here, just to record it, but ultimately, I think we need to discuss it elsewhere. I just want to wrap up the larger conversation before opening a new issue for this.


@jdesrosiers and I were chatting over DMs where he proposed the idea of breaking this up further so that each keyword has its own file. If we do this, then a vocabulary is just a collection of keyword file IDs (and probably a description, etc.). Doing this would potentially allow individual keywords to be added directly into new-form-$vocabulary or $dialect or whatever we end up using.

This would mean that vocabularies are convenient groupings of common keywords, and individual keywords can still be added to extend the vocabularies.

There was also some discussion around potentially being able to add keyword file references directly into the schemas that needed to use them via a $use keyword or something. This alleviates the need to create a custom meta-schema in order to use custom keywords. However, this has a similar issue to the proposals we got for $ignore and $sigil when discussion how the meta-schema could validate ad-hoc SVAs: we would need some kind of data keyword to do this.

gregsdennis commented 1 year ago

The more I let this sit, the more I like this idea.

Concept

Core meta-schema

{
  "$schema": "https://json-schema.org/meta/schema",
  "$id": "https://json-schema.org/meta/schema",
  "$dialect": [
    "https://json-schema.org/dialects/core",
    "https://json-schema.org/dialects/applicator",
    "https://json-schema.org/dialects/unevaluated",
    "https://json-schema.org/dialects/validation",
    "https://json-schema.org/dialects/meta-data",
    "https://json-schema.org/dialects/format-annotation",
    "https://json-schema.org/dialects/content"
  ],
  "$dynamicAnchor": "meta",

  "title": "Core and Validation specifications meta-schema",
  "type": ["object", "boolean"]
}

Core dialect

{
  // do we need `$schema` here?  maybe (read on)
  "$id": "https://json-schema.org/dialects/core",
  "$keywords": [
    "https://json-schema.org/keywords/$id",
    "https://json-schema.org/keywords/$schema",
    "https://json-schema.org/keywords/$ref",
    "https://json-schema.org/keywords/$anchor",
    "https://json-schema.org/keywords/$dynamicRef",
    "https://json-schema.org/keywords/$dynamicAnchor",
    "https://json-schema.org/keywords/$dialect",
    "https://json-schema.org/keywords/$comment",
    "https://json-schema.org/keywords/$defs"
  ],

  "title": "Meta-schema core dialect",
}

properties keyword file

{
  // do we need `$schema` here?  maybe (read on)
  "$id": "https://json-schema.org/keywords/properties",
  "roles": [ "applicator", "annotation", "assertion" ], // because it does all three
  "name": "properties", // maybe implicit by the id?

  "type": "object",
  "additionalProperties": {
    "$dynamicRef": "#meta"
  },
  "default": {}
}

What does this look like for an author that wants a custom assertion keyword?

They'd have to create a keyword file:

{
  "$id": "https://json-schema.org/keywords/minDate",
  "roles": [ "assertion" ],
  "name": "minDate",

  "type": "string",
  "format": "date-time"
}

(Note that this doesn't tell an implementation what to do with the keyword, just how to validate it's being used right. The keyword still needs logic written to support it in an implementation.)

then a dialect file:

{
  "$id": "https://my-company.com/dialects/dates",
  "$keywords": [
      "https://my-company.com/keywords/minDate",
      "https://my-company.com/keywords/maxDate"
  ],

  "title": "Date/Time support"
}

then a meta-schema:

{
  "$schema": "https://my-company.com/meta/schema",
  "$id": "https://my-company.com/meta/schema",
  "$dialect": [
    "https://json-schema.org/dialects/core",
    "https://json-schema.org/dialects/applicator",
    "https://json-schema.org/dialects/unevaluated",
    "https://json-schema.org/dialects/validation",
    "https://json-schema.org/dialects/meta-data",
    "https://json-schema.org/dialects/format-annotation",
    "https://json-schema.org/dialects/content",
    "https://my-company.com/dialects/dates"
  ],
  "$dynamicAnchor": "meta",

  "title": "Core and Validation specifications meta-schema",
  "type": ["object", "boolean"]
}

If we allow implicit references, the $dialect and $keywords keywords can be considered arrays of schemas, which means:

Custom meta-schema with inlined dialect

{
  "$schema": "https://my-company.com/meta/schema",
  "$id": "https://my-company.com/meta/schema",
  "$dialect": [
    "https://json-schema.org/dialects/core",
    "https://json-schema.org/dialects/applicator",
    "https://json-schema.org/dialects/unevaluated",
    "https://json-schema.org/dialects/validation",
    "https://json-schema.org/dialects/meta-data",
    "https://json-schema.org/dialects/format-annotation",
    "https://json-schema.org/dialects/content",
    {
      "$id": "https://my-company.com/dialects/dates",
      "$keywords": [
        {
          "$id": "https://json-schema.org/keywords/minDate",
          "roles": [ "assertion" ],
          "name": "minDate",

          "type": "string",
          "format": "date-time"
        },
        {
          "$id": "https://json-schema.org/keywords/maxDate",
          "roles": [ "assertion" ],
          "name": "maxDate",

          "type": "string",
          "format": "date-time"
        }
      ],

      "title": "Date/Time support"
    }
  ],
  "$dynamicAnchor": "meta",

  "title": "Core and Validation specifications meta-schema",
  "type": ["object", "boolean"]
}

What needs to be added to JSON Schema to do this?

We need two keywords, $dialect, $keywords, to support the infrastructure. Also, we may need keywords for whatever meta-data we want to define (e.g. roles and name in the keyword definition file).

I don't think it'd be too hard to define schemas for these keywords. I'd expect they'd be somewhat more restrictive than just the meta-schema. For example, $dialect would require a $keywords keyword.

gregsdennis commented 4 months ago

This will need to be moved into whatever vocabularies ends up being. See #1510.