Open jdesrosiers opened 3 years ago
We should probably properly define a way to identify a dialect first. From 2020-12:
A dialect is defined as a set of vocabularies and their required support identified in a meta-schema.
It has no specific identifier, but perhaps the meta-schema $id
(specified by $schema
in the schema) could suffice since this effectively describes this set of vocabularies.
First,
absoluteKeywordLocation
is not required. - @jdesrosiers
The reason it's not required is because the relative location is sufficient when the navigated path contains no references.
That said, maybe it's sufficient to include the $schema
value and/or $vocabulary
value in the cases where there is ambiguity because
I do like the idea of including something in the output, though.
I think identifying a dialect is sufficiently defined.
The "$schema" keyword is both used as a JSON Schema dialect identifier and as the identifier of a resource which is itself a JSON Schema, which describes the set of valid schemas written for this particular dialect.
Let's not get into the choice to make absoluteKeywordLocation
required or not. It's not relevant to this issue and it's just going to get us off on a tangent argument. We can save that discussion for another time. For this issue, it's sufficient to say that even if it is present, it's still not a solution to this problem.
maybe it's sufficient to include the
$schema
value and/or$vocabulary
value in the cases where there is ambiguity
A vocabulary identifier is certainly an equivalent alternative to a dialect identifier, but I'd still prefer a keyword level identifier so we don't have a new version of every keyword for every draft even for the keywords that don't change. That would be a big help for annotation related tooling that I was working on a while ago (and hope to get back to soon).
I would prefer to always include this information rather than only when the dialect changes. Consider the case when I pass off my output to a third party library for processing. That third party library shouldn't need to know anything about the original schema and wouldn't know what dialect to process the schema as if it isn't in the output, even if the dialect doesn't change. You could pass the library a default dialect to use, but it would be easier for everyone involved if that information is just always there.
@jdesrosiers just make the keyword name a fragment on the vocabulary URI. We can declare the resources to be of media type application/schema-vocabulary+json
and the fragment semantics to be that each keyword is usable as a fragment.
(We should also really register application/schema+json
and if we want it application/schema-vocabulary+json
)
This should be mentioned over in #973.
@handrews
just make the keyword name a fragment on the vocabulary URI
That's exactly how I identify keywords now and it leaves me with the problem I described in my previous comment.
{
"$id": "https://example.com/contrived-example",
"allOf": [
{
"$id": "/foo",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Foo"
},
{
"$id": "/bar",
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Bar"
},
{
"$id": "/baz",
"$schema": "https://example.com/my-dialect",
"title": "Baz"
}
]
}
In this example, the title
keywords would have the identifiers, https://json-schema.org/draft/2020-12/schema#title
, https://json-schema.org/draft/2019-09/schema#title
, and https://example.com/my-dialect#title
respectively. Let's assume that title
in https://example.com/my-dialect
has different semantics that the standard JSON Schema semantics. If I have tooling that collects all the title
annotations, the only way to do that is to maintain configurations that says when I say "title" I mean https://json-schema.org/draft/2020-12/schema#title
or https://json-schema.org/draft/2019-09/schema#title
and not https://example.com/my-dialect#title
.
If each keyword has an identifier that doesn't change across drafts, we can avoid this problem. The title
keywords from the example would be identified as, https://json-schema.org/keywords#title
, https://json-schema.org/keywords#title
, and https://example.com/my-dialect#title
respectively. Now I can ask my tooling for all the https://json-schema.org/keywords#title
annotations and not have to worry about if draft 2020-12, draft 2019-09, and my-dialect each have different semantics for title
.
Of course it could be difficult to determine when an new keyword identifier should be minted. For example, if we changed description
to allow an array of strings instead of just a string, does that deserve an new identifier? If I want to collect all the description information, both keywords express that information so maybe they should have the same identifier. However, tooling would break if it expected a string and suddenly started getting an array of strings, so maybe they should have different identifiers.
This sounds like a bigger issue than just in the output. Sure, the output is probably the primary benefactor, but it sounds like this is more about adding a property to vocabularies (maybe in line with @handrews' idea of a machine-readable vocab file) than extending output specifically.
@gregsdennis I think I see what you mean. I'm open to discussing alternatives that also solve this problem.
This was also mentioned here.
I think restructuring the output based on subschemas instead of keywords solves this (related comment).
@jdesrosiers can you confirm whether your concerns here were covered in #1249?
can you confirm whether your concerns here were covered in https://github.com/json-schema-org/json-schema-spec/pull/1249?
No, I don't see how those changes address this problem. There's still no indication in the output unit of what dialect or vocab the keyword is defined in.
@jdesrosiers @gregsdennis can we for now settle on the dialect IRI as the thing to be added? Or if we want this issue to be about a more comprehensive identification (dialect+vocab+keyword), can we split out an issue for just dialect identification?
I question the need for dialect identification in the output.
The schema doesn't declare individual keywords by their vocab; it just uses the keyword. For example, there is no https://json-schema.org/draft/next/vocab/core#$id
; it's just $id
. A meta-schema that says "these keywords are available," then the schema just includes the keywords without any mention of what vocab that keyword came from. I don't see why it should be any different for the output.
If you really need to know what vocab a keyword came from, that information is traceable by going back to the schema that produced the output and looking at its meta-schema. I don't see the value in having that information in every instance of the output, especially not for every listing of a keyword.
there is nothing in the output format that indicates the dialect that was used to determine each output unit result.
There is: schemaLocation
gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.
From the examples in the current doc:
{
"valid": true,
"evaluationPath": "",
"schemaLocation": "https://json-schema.org/schemas/example#",
"instanceLocation": "",
"annotations": {
"title": "root",
"properties": [
"foo",
"bar"
]
}
}
Properties are given here. Schema location is given here. You can look up the dialect if you want.
You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.
Properties are given here. Schema location is given here. You can look up the dialect if you want.
If I call an API that, somewhere on the back end, uses JSON Schema to evaluate some data and produce annotations, and then sends that annotation data back to me without sending me the schema, no, I can't just look up the dialect.
This is a common problem, and is why resources typically have "self" links even though the self link URI is probably the URI that you just got it from. Because sometimes you end up with a resource representation with no way to tell where it came from. It's also why it's a good idea to have an absolute $id
in a schema even if you don't have any relative URI-references in it.
We do not need the dialect URI in every output unit. We just need a map of the absolute URIs (without fragments) in schemaLocation
to dialect URIs. It can go at the top level, since the schemaLocation
is sufficient to determine which dialect applies to a given output unit.
Can someone please post an example of how they want this fit into the output?
You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.
Part of the proposal was to define URIs to identify each keyword. That would definitely be a requirement for adding a keyword URI in the output.
I still think this is the best solution in the long run because it would allow output format post-processors to know that one keyword has the same semantics as another. For example, someone could write a dialect that has the same semantics as the standard dialect except all the keywords are translated to Spanish. Tools that can process the standard dialect output should be able to process the Spanish dialect output as well because the semantics of the keywords should be the same. Of course this would require changes to the vocabulary system in addition to just defining keyword URIs, but I think this would be a good path to be on.
schemaLocation
gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.
I mention in the issue description why I didn't think that was a good enough solution and @handrews reiterated that view point.
Can someone please post an example of how they want this fit into the output?
Honestly I haven't had a chance to fully think through how this would fit in to the new output format. We could figure out how to incorporate a dialect ID into what we have now, but I think that would be a partial solution. I'd like to explore incorporating the idea of keyword URIs to the vocabulary system first and then see what that means for the output format. We can probably just put this issue on hold for now.
This is almost exactly the example in your recent PR, except I changed the output units with an evaluation path prefix of /properties/bar/$ref
so that they jumped to a different schema resource (https://json-schema.org/schemas/other
) on that $ref
, to which I assigned a different dialect/version. Please disregard the fact that the output spec doesn't match the dialects and draft-07 didn't support annotations like this.
{
"valid": true,
"dialects": {
"https://json-schema.org/schemas/example": "https://json-schema.org/draft/2020-12/schema",
"https://json-schema.org/schemas/other": "https://json-schema.org/draft-07/schema"
},
"nested": [
{
"valid": true,
"evaluationPath": "",
"schemaLocation": "https://json-schema.org/schemas/example#",
"instanceLocation": "",
"annotations": {
"title": "root",
"properties": [
"foo",
"bar"
]
}
},
{
"valid": true,
"evaluationPath": "/properties/foo/allOf/1",
"schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1",
"instanceLocation": "/foo",
"annotations": {
"title": "foo-title",
"properties": [
"foo-prop"
],
"additionalProperties": [
"unspecified-prop"
]
}
},
{
"valid": true,
"evaluationPath": "/properties/bar/$ref",
"schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar",
"instanceLocation": "/bar",
"annotations": {
"title": "bar-title",
"properties": [
"bar-prop"
]
}
},
{
"valid": true,
"evaluationPath": "/properties/foo/allOf/1/properties/foo-prop",
"schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1/properties/foo-prop",
"instanceLocation": "/foo/foo-prop",
"annotations": {
"title": "foo-prop-title"
}
},
{
"valid": true,
"evaluationPath": "/properties/bar/$ref/properties/bar-prop",
"schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar/properties/bar-prop",
"instanceLocation": "/bar/bar-prop",
"annotations": {
"title": "bar-prop-title"
}
}
]
}
@gregsdennis the above example is only for the dialect IRI. I'm not currently concerned with anything more granular (hence me asking if we can either agree to just do this, or split this out to its own issue if folks want to keep working on the more granular thing here).
So now if I want to know the dialect for that last output unit, I look at "schemaLocation": "https://json-schema.org/schemas/other#"
, chop off the fragment, and look it up as something like output["dialects"]["https://json-schema.org/schemas/other"]
🤦 I initially copy-pasted the diff instead of the changed file, so the example two comments above was a mess. It's fixed now.
I think that's doable, but defining how dialects are declared seem to still be an open topic on Keyword for identifying bootstrapping rules #217. We should nail that down before deciding how to put it in the output.
@gregsdennis I am working on a replacement proposal for #217, but that depends on how some things regarding the SDLC play out.
For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.
For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.
My point is that we don't have the "dialect IRI" defined as a concept. We seem to be using the meta-schema IRI, but is that really the same thing? It doesn't make sense to include something that isn't defined yet.
Yes, it's definitely the same thing. I thought that was made clear in the spec. If not, that's just an oversight.
Okay. I still want to hold this open until process is defined. Less to think about.
Coming back to this, the idea has warmed on me.
I like the idea of the root output unit not actually being an output unit but a host for other evaluation data with the actual evaluation results in the details
property.
From this comment and the one that follows it, I think it makes sense to have keywords identified. This aligns with the ideas to resolve keyword collisions (also from that discussion) in that their output need to be reported somehow.
I imagine output will go through some more refinement soon.
@jdesrosiers would it make sense to only include dialect
in nodes where the dialect changes, or would you have it in every node?
It could work either way, but I would include it in every node. I feel like if I was writing code to process this output, it would be annoying if I had to look in two places to determine the dialect. It would be easier if it was just always in the same place when it's needed.
Yeah, thinking about it again, it would only make any sense at all to omit it in some nodes for the hierarchical format. The list format would still need it in every node.
A schema using dialect-a can
$ref
a schema using dialect-b but, there is nothing in the output format that indicates the dialect that was used to determine each output unit result. This can be important for keywords whose semantics have changed over time, but their name hasn't changed. The ones that come to mind immediately aremaximum
/exclusiveMaximum
,minimum
/exclusiveMinimum
,format
, andlinks
.We could potentially determine the dialect for an output unit by using the
absouteKeywordLocation
to lookup the schema and inspect it's$schema
property, but that has two problems. First,absoluteKeywordLocation
is not required. Second, if the output is passed to a third party for processing, it might not have access to the original schema to look up it's dialect.The simplest approach would be to add a field such as
dialect
to the output unit. However, It would be more useful to have a unique identifier for each keyword. That way, when we introduce a new draft, if a keyword doesn't change, it's ID doesn't have to change. That would make it easier for tools to do things with annotations where they don't care what dialect they are as long as they have the same semantics (such astitle
anddescription
).