Closed gregsdennis closed 2 years ago
Recently, we worked on formalizing the process for embedding schemas of different drafts than the parent schema. If the output format is changed, how would output for embedded schemas work? Currently, it's not a problem to just use 2019-09 for everything because none of the other drafts specify an output format.
I think it should definitely not change output format mid-validation. The output format of the top level schema should be used for the whole validation. I don't think this is controversial, but it is a detail that we'll have to address as we make changes to the output format.
Not sure if I should be piling on here or creating another issue, but this seems like a good time to bring up something that bothers me about the output format.
Currently keywordLocation
is described as a required properties (technically a SHOULD, but described as required) while absoluteKeywordLocation
is described as optional. I think this is backwards. Since draft 2019-09, the value of a schema with a JSON Pointer that crosses a document boundary is undefined. That means that keywordLocation
as defined doesn't properly identity a keyword.
I can see the value of this keyword for human debugging, so I'm not proposing throwing it out entirely, but it should at least be optional and not a URI (just a pointer).
In my implementation, I don't include keywordLocation
in the output. Because it's not a true identifier, it would actually be fairly difficult (awkward and messy at the least) for me to implement.
In trying to pare down the list of nodes for the
detailed
andbasic
formats, I found that a lot of nodes were kept that may not be necessary.
I don't think the output format should trim anything. Nodes that are superfluous in one context might be helpful in another. Producing human readable output should be a process that takes the output as an argument and transforms it into something nice. https://github.com/atlassian/better-ajv-errors is an example of this concept except it uses the ajv proprietary output format rather than the standard one. I'm hoping they will come out with a version that supports the standard output format so I don't have to make my own.
If we trim the output results to improve one way of consuming the output, we might be limiting what can be done for other ways of consuming output. For example, imagine a regex101.com -like online schema validator. It might find some of those "redundant" nodes useful to build visualizations where they are just noise for text-based human reporting.
Therefore, I don't think the spec should be defining an algorithm to reduce returned nodes. We can leave that to third party tools to decide what works in their domain. It's definitely a problem we as implementors need to work on solving, but not at the spec level.
Currently
keywordLocation
is described as a required properties (technically a SHOULD, but described as required) whileabsoluteKeywordLocation
is described as optional. - @jdesrosiers
This is correct as is. You will always have a location relative to the root schema, which is what keywordLocation
gives. However, you will only have an absolute location if the root schema defines an absolute URI for $id
. Since $id
is optional in the schema, it makes sense for absoluteKeywordLocation
to be optional in the output.
Additionally, keywordLocation
gives you how you got to the keyword, including dereferences, which is generally more useful for debugging than just where the keyword is, which is what absoluteKeywordLocation
gives you.
If we trim the output results to improve one way of consuming the output, we might be limiting what can be done for other ways of consuming output. - @jdesrosiers
This is why we're discussing it. A lot of times, you just want the problem. Other times, you'd like the details. We need to isolate the rules around these scenarios. @ssilverman has some logic that he's implemented in his library already that may serve as a starting point.
Additionally, keywordLocation gives you how you got to the keyword, including dereferences, which is generally more useful for debugging than just where the keyword is, which is what absoluteKeywordLocation gives you.
In my implementation, I omit absoluteKeywordLocation from the error/annotation output iff it is identical to keywordLocation (that is, if no $ref was traversed while getting to this location, AND there is no canonical URI provided for the schema) -- I agree that it is helpful to have both pieces of information.
If we trim the output results to improve one way of consuming the output, we might be limiting what can be done for other ways of consuming output.
This is why we're discussing it.
I think you're missing the point. I'm saying that the output format should be like an assembly language. It's a standardized, low level representation of the validation result. Starting from that assembly language, people can build whatever they need that is appropriate for their domain. Since it's impossible for us the predict every way someone might want to use the output format, I think this is the only feasible path.
You will always have a location relative to the root schema, which is what
keywordLocation
gives. However, you will only have an absolute location if the root schema defines an absolute URI for $id.
This sounds like a great argument for why an id should not be optional (not necessarily with $id
). Otherwise, the output format has no way to properly identify the things it's describing. It requires a human to connect the dots and therefore limits the information I can get from the output programatically. The project I'm working on now would not be possible without an identifier for every schema. The output format should be optimized for machine readability, not human readability. The human readable version can be generated from the machine readable version. (In my implementation I designed the API such that I always have an identifier. The only way to use a schema is to use its identifier. So, I have no issue with leaving out keywordLocation
because I always have an identifier.)
I'd love to see keywordLocation
optional or even removed, but I'd be happy with changing it to require at least one of keywordLocation
or absoluteKeywordLocation
rather than require just keywordLocation
.
The project I'm working on now would not be possible without an identifier for every schema
That your scenario requires an ID doesn't imply that every scenario will. Relative location has its importance in how you got to the keyword, not just where the keyword is. This is why keywordLocation
is required.
I think you're missing the point.
I'm just saying that it needs to be discussed. I'm open to discussion, and some of your points are valid. Let's please keep the tone of this conversation open.
If you have structural suggestions, please share them. Also keep in mind the months of conversational background that led to the current state. There are requirements that we were dealing with that your use case may not have.
It's a standardized, low level representation of the validation result.
This is precisely the approach we took. What is now in the spec evolved from that.
For example, in an error scenario, is it necessary to list passing subschemas of an
allOf
?
Why is allOf
producing any annotations at all? the spec says: "[These applicator keywords] have no direct impact on annotation collection, although they enable the same annotation keyword to be applied to an instance location with different values. Annotation keywords define their own rules for combining such values." My interpretation of this is that allOf creates no annotations - it merely passes along the annotations from the (passing) subschemas that it contains.
PS. @gregdennis in your original problem description above, can you clarify which output format you are discussing?
That your scenario requires an ID doesn't imply that every scenario will.
The point is that the output format needs to be a common denominator for any application that someone might want to use it for. If my use case needs it, others will too. If the output format doesn't require it, my use case and many others won't be compatible with standard validators. keywordLocation
, on the other hand, can be derived from the standard "detailed" output if everything has an identifier. It can be left out and there are no use cases that would be incompatible with a standard validator input. That's why it can optional.
Let's please keep the tone of this conversation open.
I'm not sure what you mean by that, but if any of what I've written has come across as rude or aggressive or inappropriate in any way, I apologize. It truly was not my intention. I know I can be suborn, but I only persist when I'm not getting my point across. If we understand each other and still don't agree on a resolution, I'm happy to drop it.
Also keep in mind the months of conversational background that led to the current state. There are requirements that we were dealing with that your use case may not have.
I followed along with the conversation and I have the utmost respect for all the effort everyone put in. At the time, I wasn't in a place to contribute to that particular conversation. I'm possibly the first person to try to do something with the output other than reporting validation results. I bring up my use case not to make this about me, but to show an example of where the current output format is insufficient. I think that's valuable feedback, but if it's not helpful, I'm happy to drop it.
Why is allOf producing any annotations at all? - @karenetheridge
This question isn't saying that allOf
would be generating, or even carrying, annotations. It's merely asking whether the passing subschemas of a failed allOf
need to be included in the output. In fact, it explicitly would not return any annotations (generated or passed along) because it failed.
I'm asking for an overhaul of the output. I think the structure of the node as well as the tree structure (of each format) of the output should be revisited. Question 1 looks at the node structure, whereas questions 2 & 3 look more at the tree structure.
keywordLocation
, on the other hand, can be derived from the standard "detailed" output if everything has an identifier.
I don't think this is true in the general case. For instance, in case of a recursive schema, just knowing the absolute location (definition site) of the keyword which failed doesn't let you know at which depth of recursion it was reached. This could be perhaps inferred from the tree structure, but this seems much more error-prone (and non-serializable, meaning you can't just pass a node somewhere else in the program without losing this information). In this vein you might just as well argue that the absoluteKeywordLocation
can be derived from the other one, if you traverese the schemas and resolve $ref
-s accordingly.
I'd definitely leave the keywordLocation
intact. Perhaps it could be specified that absoluteKeywordLocation
is also mandatory for nodes stemming from schemas which do contain $id
?
Perhaps it could be specified that absoluteKeywordLocation is also mandatory for nodes stemming from schemas which do contain $id ?
absoluteKeywordLocation
is already mandatory if the path contains a $ref
or $recursiveRef
.
Throwing thoughts in:
I like the nomenclature of "terminal" and "non-terminal".
Your second point is my first in the original post.
In a Slack discussion with @karenetheridge, we figured out that we also need a way to identify when a false
schema is the thing causing the validation failure.
Normally, it's keywords that are generating validation errors. So pointers to the error source generally end in keywords, e.g. #/properties/myProp/maximum
. But suppose we had something like this:
{
"properties": {
"falseProp": false
}
}
There is no keyword here. The source of the error is the false
schema itself. To say that falseProp
is the source of the error isn't wrong, but it's not quite right either.
@karenetheridge suggested using a pointer to the property and a message like "subschema is false." It should be fairly clear that it's the false
subschema that's generating the error, not the property.
In JsonSchema.Net, I've used a special $false
to indicate this, but this was out of a lack of direction more than anything. I wanted to somehow identify that it's not a property named false, so I followed the $xxx
format of the standard keywords. To me this felt more proper since it's technically the false
subschema that's generating the error, not the applicator that contains it. It's an idea, anyway.
I don't particularly have strong feels on how, but this is a special case that the spec needs to clarify.
Regarding false schemas ...
In my implementation, validation is implemented as just another keyword. It therefore has a node in the validation output just like any other keyword. The top level node is a "validate" keyword and so is any sub-schema that's involved in the validation.
Therefore, it is a keyword that generates the validation error even when the schema is false
. It doesn't matter that there are no keywords in the schema because validating the schema itself is a keyword.
Here's the output from the "falseProp" schema from the previous comment. Notice that the thing that fails is validating the schema but it's errors array is empty because there are no keywords in that schema.
{
"keyword": "https://json-schema.org/draft/2019-09/schema#validate",
"absoluteKeywordLocation": "https://json-schema.hyperjump.io/schema1#",
"instanceLocation": "#",
"valid": false,
"errors": [
{
"keyword": "https://json-schema.org/draft/2019-09/schema#properties",
"absoluteKeywordLocation": "https://json-schema.hyperjump.io/schema1#/properties",
"instanceLocation": "#",
"valid": false,
"errors": [
{
"keyword": "https://json-schema.org/draft/2019-09/schema#validate",
"absoluteKeywordLocation": "https://json-schema.hyperjump.io/schema1#/properties/falseProp",
"instanceLocation": "#/falseProp",
"valid": false,
"errors": []
}
]
}
]
}
This approach is also useful with the not
keyword where the same keyword can be both true in the context of validation, but false in the context of the keyword. Notice In the following example that the same schema location has two nodes. One node is for the not
keyword and the other is for the validate
psudo-keyword.
{ "not": false }
{
"keyword": "https://json-schema.org/draft/2019-09/schema#validate",
"absoluteKeywordLocation": "https://json-schema.hyperjump.io/schema1#",
"instanceLocation": "#",
"valid": true,
"errors": [
{
"keyword": "https://json-schema.org/draft/2019-09/schema#not",
"absoluteKeywordLocation": "https://json-schema.hyperjump.io/schema1#/not",
"instanceLocation": "#",
"valid": true,
"errors": [
{
"keyword": "https://json-schema.org/draft/2019-09/schema#validate",
"absoluteKeywordLocation": "https://json-schema.hyperjump.io/schema1#/not",
"instanceLocation": "#",
"valid": false,
"errors": []
}
]
}
]
}
@jdesrosiers, I think you and @karenetheridge are following the same concept regarding identifying the false
schema: the path leads up to the JSON property that contains the false
schema but doesn't contain it.
I think it's worth defining a practice for the true
and false
schemas.
Regarding schemas that are true
and false
, why do we need an extra level in the location (re. the "$false" mention above)? The parent is enough to tell the schema user where the problem is. For example, in:
{
"properties": {
"falseProp": false
}
}
The location of that schema is (using JSON pointer notation) "/properties/falseProp". Yes, "falseProp" isn't a keyword, but it's certainly the JSON location of that schema. i.e. it's the "schema container name".
@ssilverman I wasn't saying that we do need it. That's what my implementation does to identify that the error is coming from the false
schema. More technically, I implement the false
as an actual JsonSchema
instance that just rejects everything, so it literally is the source of the error. Otherwise, I'd have to check for false in every applicator. It's easier (as the one writing the code, not necessarily the one consuming it) if the error just comes from the false
schema itself.
It seems that the consensus is that we merely need to report up to but not including the false
.
Some more food for thought: Instead of calling the error output locations "keywords", why not just use "schemaLocation" and "absoluteSchemaLocation" instead? This way, we won't get stuck on these locations actually pointing to "keywords". For example, we'd have:
"schemaLocation": "/properties/falseProp",
"absoluteSchemaLocation": "scheme://host/example.json#/properties/falseProp"
From:
{
"properties": {
"falseProp": false
}
}
I know I was stuck on restricting my output to only keywords, but in fact, it turns out that's not useful, especially for non-keywords wrapping a schema (in this example; "falseProp" isn't a keyword, yet it's useful to describe the error location).
[Update] Because of the use of the word "keyword", for this case, I would simply say that "properties" failed and my message contained something like, "failed in 'falseProp'", which isn't that helpful if you can't parse error messages and if there's lots of properties.
I suppose my solution is a middle ground. Every node in the output represents a keyword, but in the case of the pseudo-keyword "validate", the name of the keyword doesn't match the location. "falseProp" is treated like an alias for "validate". So, /properties/falseProp
is pointing to a keyword, you just can't tell from looking at the pointer that the keyword it's pointing to is actually the "validate" keyword.
Some more food for thought: Instead of calling the error output locations "keywords", why not just use "schemaLocation" and "absoluteSchemaLocation" instead?
FWIW, I have a 'normalize_evaluation_result' method in my $work application that does exactly that -- the structure returned to the user in error responses uses the properties "data_location", "schema_location" and "absolute_schema_location" :)
I think it's worth summarizing the tweaks we have so far (or at least what I have in my head):
nestedResults
(still open to naming). The purpose of this is to have the same property for both passing and failing result sets.annotations
property (same as current), but it just needs to be an array of values since the locations are provided by the node.instanceLocation
(@karenetheridge suggested data_location
, but "instance" is canonically used within the spec, and I prefer consistency)schemaLocation
absoluteSchemaLocation
- only if a $ref
-like keyword is presentvalid
required at each nodeerror
as required by the keyword (the same way that keywords specify their annotations), but wording is implementation-specific (preferably configurable to allow for custom messages or internationalization; mine supports strings with predefined tokens for value inclusion as well)Also, the algorithms to determine which nodes need to be kept for the basic
and detailed
formats needs tweaking.
In my implementations, it was easiest to build the full verbose output structure, then pare down using a set of rules. I had tried to summarize those rules in the spec, but I think it's a bit hard to follow outside of the context of reading my code.
basic
is essentially a collection of leaf nodes. Intermediate nodes should be removed and the leaves collected into a list nested under the first node. If the nested list consists of a single node, that node is used as the root.detailed
condenses single-node paths to merely the leaf node.
valid
value doesn't match their immediate parent node are removed*, e.g. false results of passing any
, true results of failing all
.* not
is an interesting case for the second reduction rule since it will always have a single node that doesn't match its valid
value. If not
is valid, then the subschema isn't and it's probably important to know why, meaning we'd want to get the subschema results, which means we don't want to remove its child, even though that violates the second reduction rule. If not
is invalid, then its subschema is valid and it doesn't make much sense to include that node, but it doesn't hurt, either. In either case, it seems that the best behavior for not
is to always keep its child.
The fact that we have a special case in the behavior of not
implies that we need to account for special cases. Perhaps this can be specified by the keyword, like error messages?
It was suggested in Slack by @ssilverman that following the schema structure is not necessarily useful in cases where authoring is complete and the schema is being used to validate instances. He suggests that following the instance structure may be more useful in these cases.
Arguably these cases are more common since authorship ideally occurs once.
The primary benefit of such a structure would be that all errors and annotations that person to a specific instance location would be collected together.
I want to review our original decision to follow the agenda rather than our instance, but I think it was motivated from a point of view of authorship and debugging of the schema itself rather than repeatedly validating varying instances as would happen in a live system and one might expect from an online validator.
In #1065 @jdesrosiers suggests fully identifying keywords in the output format, and I suggest doing so by using the keyword name as a fragment on the URI of the vocabulary. @gregsdennis mentioned that this should get noted here, so... noted!
I'm starting to look into actually updating this section, and one of the first things I'd like to do is work out an instance-based format. This is what I've come up with so far.
For convenience, this is the example in the spec:
// schema
{
"$id": "https://example.com/polygon",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$defs": {
"point": {
"type": "object",
"properties": {
"x": { "type": "number" },
"y": { "type": "number" }
},
"additionalProperties": false,
"required": [ "x", "y" ]
}
},
"type": "array",
"items": { "$ref": "#/$defs/point" },
"minItems": 3
}
// instance
[
{
"x": 2.5,
"y": 1.3
},
{
"x": 1,
"z": 6.7
}
]
Errors are:
y
.z
but shouldn't.Instance-based output would look something like this:
[
{ "valid": true },
{
"x": { "valid": true },
"y": {
"keywordLocation": "/items/$ref/required",
"absoluteKeywordLocation": "https://example.com/polygon#/$defs/point/required",
"error": "Required property 'y' not found."
},
"z": {
"keywordLocation": "/items/$ref/additionalProperties",
"absoluteKeywordLocation": "https://example.com/polygon#/$defs/point/additionalProperties",
"error": "Additional property 'z' found but was invalid."
}
},
{
"valid": false,
"keywordLocation": "/minItems",
"instanceLocation": "",
"error": "Expected at least 3 items but found 2"
}
]
So the idea here is that any values will be replaced by the error(s). It's not shown here, but a single location can have multiple errors, and these would be collected into an array, whereas a single error doesn't necessarily need an enclosing array.
Optionally the { "valid": true }
could be replaced by the value given in the instance, but having the result object may be terser in some cases, depending on the value. Personally, I like having it as it seems to collapse an entire valid subtree down to this simple construct.
There's also an added item for minItems
. There's not really much we can do for arrays aside from just adding an item. Again, I think wrapping all of the errors in a single array and adding this to the "instance" array would be best. This way there's only ever a single item added to the array.
I think a format like this could be quite useful to anything that uses schema to generate or validate input from forms.
Thoughts?
@gregsdennis I was working on something similar a while ago. I can't remember why, but I found trying to mimic the structure of the instance to be problematic. What I came up with instead was a simple 2D Map from location pointer to keyword to results. Given an instance location, you can easily get all the annotations that apply to that location or all the annotations for a certain keyword that apply to that location. This approach isn't fully vetted either and might have problems as well, but I'm sharing in case it's useful. Here's what your example would look like.
{
"#": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": false }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#items": [{ "valid": false }],
"https://json-schema.org/draft/2020-12/schema#minItems": [{ "valid": false }]
},
"#/0": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#properties": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#additionalProperties": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#required": [{ "valid": true }]
},
"#/0/x": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }]
},
"#/0/y": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }]
},
"#/1": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": false }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#properties": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#additionalProperties": [{ "valid": false }],
"https://json-schema.org/draft/2020-12/schema#required": [{ "valid": false }]
},
"#/1/x": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }]
},
"#/1/z": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": false }]
}
}
I'm using URIs to identify keywords rather than just the name to distinguish between different drafts. I've used a simple { "valid": true|false }
block to make the example easier to read, but these should be full output unit blocks. The leaf is always an array because it is possible for a keyword to apply to one location more than once although that never comes up in this example. I treat validating a schema like an implied keyword called "validate" which is why that appears for every location.
I found trying to mimic the structure of the instance to be problematic
I've found this, too. In mimicking any structure, we have to end up modifying it to inject the error data. It awkward at best.
I like what you have. I think the keywords as URIs is unnecessary, but I understand where you're coming from and I think we need a separate issue for it. I don't see how a single keyword can generate multiple errors for a single instance location, though. And what is the #validate
URI?
Also, this seems more of a variation on the current "basic" output, which is a flat list of errors. In conversations with people who wanted an instance-based structure when we were initial considering a formalized output, it was clear they wanted a hierarchical format, just that it followed the instance.
I don't mean to derail the conversation with my extensions. This was not a proposal. I'm just showing what I've done. Feel free to incorporate as little or as much as you like into your proposal.
I think the keywords as URIs is unnecessary, but I understand where you're coming from and I think we need a separate issue for it.
See #1065.
I don't see how a single keyword can generate multiple errors for a single instance location
"allOf": [{ "title": "foo" }, { "title": "bar" }]
. In this case "title": "foo"
and "title": "bar"
both apply to the same location. Different title
keywords, but same location.
And what is the
#validate
URI?
Not sure if you missed the explanation or it wasn't clear ...
I treat validating a schema like an implied keyword called "validate" which is why that appears for every location.
I think you can safely ignore it. It's really useful for me, but not standard.
Oh, no. At this point, I'm open to discussing how else we can represent errors. What we have is the best we could come up with before, and it's received.... criticism... about its usefulness in certain scenarios. I just want to address that.
Keep the ideas coming.
"allOf": [{ "title": "foo" }, { "title": "bar" }]
. In this case"title": "foo"
and"title": "bar"
both apply to the same location. Different title keywords, but same location.
I see your point, but this illustrates to me that maybe it's not necessary to group by keyword within this format. I'd see these two "title" keywords as separate because they're in different subschemas of the allOf
. Since the keyword (and it's specific location within the schema) is part of the error output, I think just a flat list of errors for each instance location is sufficient. Maybe I'm wrong, though, and grouping would be desired.
Not sure if you missed the explanation or it wasn't clear ...
🤦 So this is like a summary for this instance location, then?
So this is like a summary for this instance location, then?
Not entirely. There will be a result for each schema validated against that location. For example,
{
"allOf": [
{ "title": "foo" },
{ "title": "bar" }
]
}
{
"#": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": true }, { "valid": true }, { "valid": true }],
"https://json-schema.org/draft/2020-12/schema#allOf": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#title": [{ "valid": true }, { "valid": true }]
}
}
This has three #validate
results: one for the overall schema and one for each of the allOf
schemas that also applied to this location.
I don't understand the purpose of having that. It seems like a lot of repeated information.
Nothing is repeated, it's just really thorough. Here's a slightly more useful example.
{
"allOf": [
{ "type": "number" },
{ "maximum": 5 }
]
}
6
{
"#": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": false }, { "valid": true }, { "valid": false }],
"https://json-schema.org/draft/2020-12/schema#allOf": [{ "valid": false }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#maximum": [{ "valid": false }]
}
}
#validate
tells us that the schema at /allOf/1
failed validation. You could extrapolate that from the #maximum
failure, but this allows tooling to skip the expense and complexity of computing that information. Admittedly, this information isn't always useful. For example, I filter out these nodes when presenting errors to users on https://json-schema.hyperjump.io. Humans can trivially extrapolate that information and including it actually increases the cognitive load. But, if I one day I add inline errors to the UI, I expect I would want that data to build the UI.
Here's an example of how I was finding the #validate
result useful in some unfinished work I was doing a while ago.
{
"type": "object",
"properties": {
"foo": { "type": "string" }
}
}
{ "foo": "bar" }
{
"#": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#properties": [{ "valid": true }]
},
"#/foo": {
"https://json-schema.org/draft/2020-12/schema#validate": [{ "valid": true }],
"https://json-schema.org/draft/2020-12/schema#type": [{ "valid": true }]
}
}
We know the instance is valid. If we then modify the "foo" property to "baz" and want to revalidate, it would be nice if we didn't have to revalidate everything. We just want to revalidate what as changed. I can use the results to identify the sub-schema to run validation against the new value ("baz").
Hey everyone. I've been reading through all of the comments in this and linked issues, compiling a list of topics.
It seems that we have consensus on these things:
errors
to nested
/subschemas
/inner
/details
(other suggestions?)detailed
to be specified as MAY instead of SHOULD (like verbose
is currently)$false
in the location)valid
is required for each nodeerror
is not required but is declared as the designated place to put error messaging. Tooling that consumes this output should expect an error message here.
These things still require decisions (but I've proposed some things):
absoluteKeywordLocation
instead of keywordLocation
keywordLocation
could be validationPath
and optional (default off) to the end user; opens the door for absoluteKeywordLocation
to change to schemaLocation
(for terseness); implementations SHOULD support an option mechanism which determines the inclusion of validationPath
.keyword
to support renaming :point_up: (and maybe dialect
- import from #1065). Only makes sense when a keyword is present. For instance at the root or #/allOf/3
there is no keyword, so these would be absent.validationPath
and instanceLocation
can be plain JSON Pointers and do not need to be URI-encodedbasic
and detailed
formats? My users have said no. (verbose
should show everything.)detailed
and verbose
to *-by-schema
and add *-by-instance
(or some other way to indicate both structure and quantity of info) to support instance-based structures (see below for examples)basic
is a collection of terminal nodesdetailed-by-schema
not
)detailed-by-instance
I'll discuss later.And this isn't directly about output, but supports it:
absoluteSchemaLocation
/schemaLocation
. This may be in the spec already, but I figured it's worth mentioning.Annotations need to be part of the node, not rendered as child validation nodes. I don't think there's any argument here.
I've been thinking around this for a couple hours now, and it seems we have three orthogonal attributes that we need to consider.
Not all keywords provide validation. For example, the spec says nothing about validation for title
and description
.
My implementation handles this by always passing validation for these keyword, which works to provide the correct pass/fail outcome, but it also means that they produce an output node. So I get something like this as a node:
{
"valid": true,
"keywordLocaation": "#/description",
"schemaLocation": "https://example.com/mySchema#/description",
"instanceLocation": "",
"annotation": "description"
}
Question: should pure-annotation keywords produce an output node or simply be listed in parent nodes?
I can see arguments for both sides, and I think I have a pretty easy tweak to my implementation that would hide these, so I'm not invested either way.
These are actually two different ideas, but I think I often conflate them, so I'm going to consider them distinct here.
A keyword can produce annotations, and it can propagate annotations from its children. Most keywords do one or the other, but some do both.
title
and description
only produce annotations.allOf
which only propagate child annotations.properties
both produce (which of the listed properties are present) and propagate annotations.Regardless of how an implementation manages annotations internally, I believe it's worthwhile keeping these distinctions separate when reporting them to the user. It can be useful to know whether a keyword produced an annotation or is just passing a message.
I'd like to propose two more output node properties to cover each of these ideas:
annotation
property. This is just the raw value of the annotation (e.g. a string for title
or an array of strings for properties
).collectedAnnotations
property. This is an object where the key is the schema location (i.e. the value that would be in schemaLocation
) of the keyword that produced it, and the value is the annotation itself.The downside to listing collected annotations is duplication of annotation values. But on the upside, you can:
These examples take the spec scenario and show what the output would look like if we changed all of this, including my suggestions on the open discussion topics.
For reference, here's the spec scenario:
// schema
{
"$id": "https://example.com/polygon",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$defs": {
"point": {
"type": "object",
"properties": {
"x": { "type": "number", "description": "x coordinate" },
"y": { "type": "number", "description": "y coordinate" }
},
"additionalProperties": false,
"required": [ "x", "y" ]
}
},
"type": "array",
"items": { "$ref": "#/$defs/point" },
"minItems": 3
}
// instance
[
{
"x": 2.5,
"y": 1.3
},
{
"x": 1,
"z": 6.7
}
]
{ "valid": true }
(I never really liked having a top-level node here, but I don't know how else to collect the nodes. I do like consistently having an object at the root, as opposed to an array.)
Note the inclusion of the annotations from the subschemas which passed validation and how they're no longer included once we move up through a subschema that failed validation.
I think I've been able to marry our two approaches. Considerations:
I do need to add a results
property to contain the validation results that pertain to each location, and nested
as an object instead of an array made more sense here.
Initially I was only considering the verbose format for instance-based, but in building that example, I realized how to pare down the fluff.
The rules I'm following for this are:
validationPath
is contained within another unit's validationPath
, it (the first one) can be removed.This second rule removes pass-through nodes like the kind you'd get having just resolved a $ref
and arriving at a new subschema. These nodes are just a summary of what's underneath and don't add much to the result.
@gregsdennis That's a lot! I'm going to avoid pulling on too many threads at once. So, I'll start with the consensus list. Here's a couple notes. For any of the items I don't mention, you can assume I'm in agreement with.
detailed
to be specified as MAY instead of SHOULD (likeverbose
is currently)
I don't know what this means. Can you provide a little more explanation.
error
is required as as specified by the keyword (the same way that keywords specify their annotations)
- wording is implementation-specific (preferably configurable to allow for custom messages or internationalization; token replacement as a possible suggested mechanism?)
I don't think error
should be a required field. Ideally, I'd like to just provide the output results to some third party library like https://github.com/atlassian/better-ajv-errors or https://github.com/apideck-libraries/better-ajv-errors and let them deal with messaging and translation. Given the schema and the instance, all these libraries should need are pointers to the keyword and instance where the error occurred.
detailed
to be specified as MAY instead of SHOULD (likeverbose
is currently)
verbose
is currently specified as a "MAY" requirement, but detailed
is a "SHOULD". I thought to help support implementers with the change, a little loosening of the spec wasn't a bad thing.
I don't think error should be a required field...
Fair point. I can go with that. I'll update above.
Working branch is draft-next-output.
Unless I'm missing some contradicting wording somewhere else, detailed
is not a SHOULD by itself. Implementing flag
or basic
covers the requirement as well.
An implementation SHOULD provide at least one of the "flag", "basic", or "detailed" format and MAY provide the "verbose" format. If it provides one or more of the "detailed" or "verbose" formats, it MUST also provide the "flag" format.
🤔 yeah, I see what you're getting at. I'll do some digging to see where we decided that. Do you have an opinion on this requirement?
I don't have a strong opinion. I'd suggest that implementations MUST support flag
and SHOULD support basic
or detailed
. Anything beyond that is just nice-to-have.
Given that I'm expanding detailed
into detailed-schema
(what we have currently) and detailed-instance
(an instance-based format), would you say that implementations SHOULD provide just one of them? It seems odd to prefer one over another, whether by the spec or the implementation.
I do like requiring flag
, though.
I agree that we probably shouldn't prefer one over the other. So, I'd suggest: MUST support flag
and SHOULD support basic
, detailed-schema
, or detailed-instance
. I would personally (not in the spec) encourage implementations to support all of the output formats because each is useful in different situations, but as far as minimum requirements, at least one format with error details should be sufficient.
I haven't had a chance to go over all of this in detail and respond to all the points that I want to respond to -- but I wonder if it might be worthwhile moving to a community discussion so we can have threaded conversations about the various points? Many of them are not directly related, and some require a deep dive to understand motivations and to examine tradeoffs. (We might need an ADR for these, too!)
Sounds good. I'll copy this to a discussion, making edits for the conversation @jdesrosiers and I have had so far. We can create threads for each item.
We don't have GH discussions on in this repo so I opened one in the Community repo with the General topic: https://github.com/json-schema-org/community/discussions/63
I think this is basically resolved at this point. Closing for now.
While writing Section 10 on output, I was implementing the formatting in my library, Manatee.Json. In that library, I don't collect annotations in the same way as described in the spec. As a result the annotation output of a successful validation was not fully explored.
I have recently been working on a new validator that does follow the recommendations of the spec in regard to annotation collection and have reached the same questions that many other implementors have asked me regarding annotations in the output format, to which I naively responded that annotation output is analogous to error output.
I aim to fix this. This issue is a compilation of questions and issues I've experienced trying to reimplement output formatting as well as potential solutions to some of them.
1. Annotations vs Nested Results
With the current output, annotations are to be represented as their own output unit within the
annotations
collection, including thevalid
property and all JSON pointers indicating where the annotation came from and to what part of the instance it applies. Additionally, any nested results were also to be in this array. This conflates annotations with nested results and the only way to identify which were annotations and which were nested results was via the location properties.To resolve this, we should split these collections. Nested results should remain as they are, but the property name needs to change.
The proposed fix adds (or hijacks, rather) a single property,
annotations
to the output unit. It's only present forvalid = true
and when there are annotations (no empty arrays). The value is an array of all annotations for this node.For instance,
properties
requires that it adds the names of any validated properties as annotations. To this end, theannotations
on theproperties
node will be an array of these property names.I'm not sure how this fits in with collecting unknown keywords as annotations since that would effectively require a node for each unknown keyword, but maybe that's fine (the location pointer would indicate the keyword, and the annotation value would be the unknown keyword's value).
This idea also may invert the intent behind dropping combining annotations, but I haven't thought through that completely, yet.
This is also in line with @ssilverman's comment regarding how his implementation handles annotations internally. (My new validator uses a similar mechanism.)
2. Nested Nodes
Having the nested result nodes under
annotations
for validated results anderrors
for invalidated results was, admittedly, me trying to be clever about things. Practically, it's difficult to deal with.The proposal here is to have these always under a consistent name, e.g.
nested
,results
, etc. I'm not too fussed about what the name is, but it should always be that regardless of the value ofvalid
.3. Too Many ~Secrets~ Nodes
In trying to pare down the list of nodes for the
detailed
andbasic
formats, I found that a lot of nodes were kept that may not be necessary.For example, in an error scenario, is it necessary to list passing subschemas of an
allOf
? To a person just trying to figure out what's wrong, these are just noise to be filtered out in order to pinpoint the one subschema that failed.The algorithm for reducing returned values could be adjusted, and maybe needs to be dynamic based on the scenario. Maybe if a schema is valid, I don't care about any of the details and I just want a ✔️. But for failures I want to know only those nodes that failed, but I want detailed info about them, and maybe in a format that mimicks the schema (as
detailed
is intended to do). The current formats (and the algorithms to generate them) do not support this.(Fake internet points for anyone who gets my reference in the title of this section.)