Add keyword id to output unit

jdesrosiers commented 3 years ago

A schema using dialect-a can $ref a schema using dialect-b but, there is nothing in the output format that indicates the dialect that was used to determine each output unit result. This can be important for keywords whose semantics have changed over time, but their name hasn't changed. The ones that come to mind immediately are maximum/exclusiveMaximum, minimum/exclusiveMinimum, format, and links.

We could potentially determine the dialect for an output unit by using the absouteKeywordLocation to lookup the schema and inspect it's $schema property, but that has two problems. First, absoluteKeywordLocation is not required. Second, if the output is passed to a third party for processing, it might not have access to the original schema to look up it's dialect.

The simplest approach would be to add a field such as dialect to the output unit. However, It would be more useful to have a unique identifier for each keyword. That way, when we introduce a new draft, if a keyword doesn't change, it's ID doesn't have to change. That would make it easier for tools to do things with annotations where they don't care what dialect they are as long as they have the same semantics (such as title and description).

gregsdennis commented 3 years ago

We should probably properly define a way to identify a dialect first. From 2020-12:

A dialect is defined as a set of vocabularies and their required support identified in a meta-schema.

It has no specific identifier, but perhaps the meta-schema $id (specified by $schema in the schema) could suffice since this effectively describes this set of vocabularies.

First, absoluteKeywordLocation is not required. - @jdesrosiers

The reason it's not required is because the relative location is sufficient when the navigated path contains no references.

That said, maybe it's sufficient to include the $schema value and/or $vocabulary value in the cases where there is ambiguity because

a dereference takes you to a new dialect (different meta-schema),
a keyword contains multiple semantic definitions due to the specific set of vocabularies include (this also brings up the namespacing concept we had discussed somewhere else).

I do like the idea of including something in the output, though.

jdesrosiers commented 3 years ago

I think identifying a dialect is sufficiently defined.

The "$schema" keyword is both used as a JSON Schema dialect identifier and as the identifier of a resource which is itself a JSON Schema, which describes the set of valid schemas written for this particular dialect.

Let's not get into the choice to make absoluteKeywordLocation required or not. It's not relevant to this issue and it's just going to get us off on a tangent argument. We can save that discussion for another time. For this issue, it's sufficient to say that even if it is present, it's still not a solution to this problem.

maybe it's sufficient to include the $schema value and/or $vocabulary value in the cases where there is ambiguity

A vocabulary identifier is certainly an equivalent alternative to a dialect identifier, but I'd still prefer a keyword level identifier so we don't have a new version of every keyword for every draft even for the keywords that don't change. That would be a big help for annotation related tooling that I was working on a while ago (and hope to get back to soon).

I would prefer to always include this information rather than only when the dialect changes. Consider the case when I pass off my output to a third party library for processing. That third party library shouldn't need to know anything about the original schema and wouldn't know what dialect to process the schema as if it isn't in the output, even if the dialect doesn't change. You could pass the library a default dialect to use, but it would be easier for everyone involved if that information is just always there.

handrews commented 3 years ago

@jdesrosiers just make the keyword name a fragment on the vocabulary URI. We can declare the resources to be of media type application/schema-vocabulary+json and the fragment semantics to be that each keyword is usable as a fragment.

(We should also really register application/schema+json and if we want it application/schema-vocabulary+json)

gregsdennis commented 3 years ago

This should be mentioned over in #973.

jdesrosiers commented 3 years ago

@handrews

just make the keyword name a fragment on the vocabulary URI

That's exactly how I identify keywords now and it leaves me with the problem I described in my previous comment.

{
  "$id": "https://example.com/contrived-example",
  "allOf": [
    {
      "$id": "/foo",
      "$schema": "https://json-schema.org/draft/2020-12/schema",
      "title": "Foo"
    },
    {
      "$id": "/bar",
      "$schema": "https://json-schema.org/draft/2019-09/schema",
      "title": "Bar"
    },
    {
      "$id": "/baz",
      "$schema": "https://example.com/my-dialect",
      "title": "Baz"
    }
  ]
}

In this example, the title keywords would have the identifiers, https://json-schema.org/draft/2020-12/schema#title, https://json-schema.org/draft/2019-09/schema#title, and https://example.com/my-dialect#title respectively. Let's assume that title in https://example.com/my-dialect has different semantics that the standard JSON Schema semantics. If I have tooling that collects all the title annotations, the only way to do that is to maintain configurations that says when I say "title" I mean https://json-schema.org/draft/2020-12/schema#title or https://json-schema.org/draft/2019-09/schema#title and not https://example.com/my-dialect#title.

If each keyword has an identifier that doesn't change across drafts, we can avoid this problem. The title keywords from the example would be identified as, https://json-schema.org/keywords#title, https://json-schema.org/keywords#title, and https://example.com/my-dialect#title respectively. Now I can ask my tooling for all the https://json-schema.org/keywords#title annotations and not have to worry about if draft 2020-12, draft 2019-09, and my-dialect each have different semantics for title.

Of course it could be difficult to determine when an new keyword identifier should be minted. For example, if we changed description to allow an array of strings instead of just a string, does that deserve an new identifier? If I want to collect all the description information, both keywords express that information so maybe they should have the same identifier. However, tooling would break if it expected a string and suddenly started getting an array of strings, so maybe they should have different identifiers.

gregsdennis commented 3 years ago

This sounds like a bigger issue than just in the output. Sure, the output is probably the primary benefactor, but it sounds like this is more about adding a property to vocabularies (maybe in line with @handrews' idea of a machine-readable vocab file) than extending output specifically.

jdesrosiers commented 3 years ago

@gregsdennis I think I see what you mean. I'm open to discussing alternatives that also solve this problem.

gregsdennis commented 2 years ago

This was also mentioned here.

I think restructuring the output based on subschemas instead of keywords solves this (related comment).

gregsdennis commented 2 years ago

@jdesrosiers can you confirm whether your concerns here were covered in #1249?

jdesrosiers commented 2 years ago

can you confirm whether your concerns here were covered in https://github.com/json-schema-org/json-schema-spec/pull/1249?

No, I don't see how those changes address this problem. There's still no indication in the output unit of what dialect or vocab the keyword is defined in.

handrews commented 2 years ago

@jdesrosiers @gregsdennis can we for now settle on the dialect IRI as the thing to be added? Or if we want this issue to be about a more comprehensive identification (dialect+vocab+keyword), can we split out an issue for just dialect identification?

gregsdennis commented 2 years ago

I question the need for dialect identification in the output.

The schema doesn't declare individual keywords by their vocab; it just uses the keyword. For example, there is no https://json-schema.org/draft/next/vocab/core#$id; it's just $id. A meta-schema that says "these keywords are available," then the schema just includes the keywords without any mention of what vocab that keyword came from. I don't see why it should be any different for the output.

If you really need to know what vocab a keyword came from, that information is traceable by going back to the schema that produced the output and looking at its meta-schema. I don't see the value in having that information in every instance of the output, especially not for every listing of a keyword.

there is nothing in the output format that indicates the dialect that was used to determine each output unit result.

There is: schemaLocation gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.

From the examples in the current doc:

    {
      "valid": true,
      "evaluationPath": "",
      "schemaLocation": "https://json-schema.org/schemas/example#",
      "instanceLocation": "",
      "annotations": {
        "title": "root",
        "properties": [
          "foo",
          "bar"
        ]
      }
    }

Properties are given here. Schema location is given here. You can look up the dialect if you want.

You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.

handrews commented 2 years ago

Properties are given here. Schema location is given here. You can look up the dialect if you want.

If I call an API that, somewhere on the back end, uses JSON Schema to evaluate some data and produce annotations, and then sends that annotation data back to me without sending me the schema, no, I can't just look up the dialect.

This is a common problem, and is why resources typically have "self" links even though the self link URI is probably the URI that you just got it from. Because sometimes you end up with a resource representation with no way to tell where it came from. It's also why it's a good idea to have an absolute $id in a schema even if you don't have any relative URI-references in it.

We do not need the dialect URI in every output unit. We just need a map of the absolute URIs (without fragments) in schemaLocation to dialect URIs. It can go at the top level, since the schemaLocation is sufficient to determine which dialect applies to a given output unit.

gregsdennis commented 2 years ago

Can someone please post an example of how they want this fit into the output?

jdesrosiers commented 2 years ago

You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.

Part of the proposal was to define URIs to identify each keyword. That would definitely be a requirement for adding a keyword URI in the output.

I still think this is the best solution in the long run because it would allow output format post-processors to know that one keyword has the same semantics as another. For example, someone could write a dialect that has the same semantics as the standard dialect except all the keywords are translated to Spanish. Tools that can process the standard dialect output should be able to process the Spanish dialect output as well because the semantics of the keywords should be the same. Of course this would require changes to the vocabulary system in addition to just defining keyword URIs, but I think this would be a good path to be on.

schemaLocation gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.

I mention in the issue description why I didn't think that was a good enough solution and @handrews reiterated that view point.

Can someone please post an example of how they want this fit into the output?

Honestly I haven't had a chance to fully think through how this would fit in to the new output format. We could figure out how to incorporate a dialect ID into what we have now, but I think that would be a partial solution. I'd like to explore incorporating the idea of keyword URIs to the vocabulary system first and then see what that means for the output format. We can probably just put this issue on hold for now.

handrews commented 2 years ago

This is almost exactly the example in your recent PR, except I changed the output units with an evaluation path prefix of /properties/bar/$ref so that they jumped to a different schema resource (https://json-schema.org/schemas/other) on that $ref, to which I assigned a different dialect/version. Please disregard the fact that the output spec doesn't match the dialects and draft-07 didn't support annotations like this.

{
  "valid": true,
  "dialects": {
    "https://json-schema.org/schemas/example": "https://json-schema.org/draft/2020-12/schema",
    "https://json-schema.org/schemas/other": "https://json-schema.org/draft-07/schema"
  },
  "nested": [
    {
      "valid": true,
      "evaluationPath": "",
      "schemaLocation": "https://json-schema.org/schemas/example#",
      "instanceLocation": "",
      "annotations": {
        "title": "root",
        "properties": [
          "foo",
          "bar"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/foo/allOf/1",
      "schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1",
      "instanceLocation": "/foo",
      "annotations": {
        "title": "foo-title",
        "properties": [
          "foo-prop"
        ],
        "additionalProperties": [
          "unspecified-prop"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/bar/$ref",
      "schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar",
      "instanceLocation": "/bar",
      "annotations": {
        "title": "bar-title",
        "properties": [
          "bar-prop"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/foo/allOf/1/properties/foo-prop",
      "schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1/properties/foo-prop",
      "instanceLocation": "/foo/foo-prop",
      "annotations": {
        "title": "foo-prop-title"
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/bar/$ref/properties/bar-prop",
      "schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar/properties/bar-prop",
      "instanceLocation": "/bar/bar-prop",
      "annotations": {
        "title": "bar-prop-title"
      }
    }
  ]
}

handrews commented 2 years ago

@gregsdennis the above example is only for the dialect IRI. I'm not currently concerned with anything more granular (hence me asking if we can either agree to just do this, or split this out to its own issue if folks want to keep working on the more granular thing here).

So now if I want to know the dialect for that last output unit, I look at "schemaLocation": "https://json-schema.org/schemas/other#", chop off the fragment, and look it up as something like output["dialects"]["https://json-schema.org/schemas/other"]

handrews commented 2 years ago

🤦 I initially copy-pasted the diff instead of the changed file, so the example two comments above was a mess. It's fixed now.

gregsdennis commented 2 years ago

I think that's doable, but defining how dialects are declared seem to still be an open topic on Keyword for identifying bootstrapping rules #217. We should nail that down before deciding how to put it in the output.

handrews commented 2 years ago

@gregsdennis I am working on a replacement proposal for #217, but that depends on how some things regarding the SDLC play out.

For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.

gregsdennis commented 2 years ago

For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.

My point is that we don't have the "dialect IRI" defined as a concept. We seem to be using the meta-schema IRI, but is that really the same thing? It doesn't make sense to include something that isn't defined yet.

handrews commented 2 years ago

Yes, it's definitely the same thing. I thought that was made clear in the spec. If not, that's just an oversight.

gregsdennis commented 2 years ago

Okay. I still want to hold this open until process is defined. Less to think about.

gregsdennis commented 1 year ago

Coming back to this, the idea has warmed on me.

I like the idea of the root output unit not actually being an output unit but a host for other evaluation data with the actual evaluation results in the details property.

From this comment and the one that follows it, I think it makes sense to have keywords identified. This aligns with the ideas to resolve keyword collisions (also from that discussion) in that their output need to be reported somehow.

I imagine output will go through some more refinement soon.

gregsdennis commented 9 months ago

@jdesrosiers would it make sense to only include dialect in nodes where the dialect changes, or would you have it in every node?

jdesrosiers commented 9 months ago

It could work either way, but I would include it in every node. I feel like if I was writing code to process this output, it would be annoying if I had to look in two places to determine the dialect. It would be easier if it was just always in the same place when it's needed.

gregsdennis commented 9 months ago

Yeah, thinking about it again, it would only make any sense at all to omit it in some nodes for the hierarchical format. The list format would still need it in every node.

json-schema-org / json-schema-spec

Add keyword id to output unit #1065