Closed trajano closed 6 years ago
I've updated PR #610, which covers the annotation collection process and requirements for data collected, but without specifying the exact format. It now also has an example of how some of that information is intended to be used by applications.
@gregsdennis your most recent comment sounds great, except that the resolved schema needs to be a URI, not just a pointer. Schema locations may come from multiple schema documents, so a pointer alone is inadequate.
The schema path can be a pointer, although we'll have to be clear about it being a pointer for the post-resolution data structure (or some such terminology).
I find the goal of the output is to provide a guide to the most efficient, implementable, minimal and extensible output for a foundation, that's why hierarchy didn't sit so well with me, recursion being more expensive than flat loops.
@gregsdennis given your suggestion of a simple boolean result do you feel that perhaps the result should be within an object to provide the overall output?
It goes against the mantra I just wrote of being minimal, but it could also make it easier to consistently expect the valid
property to exist, but also to be able to know when there are errors vs annotations in just an array result. The annotation and validation processes still remain quite separate.
Do you see the end array result combined or under separate keywords?
// Minimum
{
"valid": true
}
// Annotation focused:
{
"valid": true,
"annotation": [...]
}
// Errors and/or Annotations:
{
"valid": false,
"errors": [...],
"annotations": [...]
}
// or
{
"valid": false,
"results": [...errors, ...annotations]
}
That seems logical. đź––
Errors and annotations should never both be populated since failing validation removes annotations. That said, I like your third example, but only one need appear in the object at a time.
// Minimum
{
"valid": true
}
// Verbose success:
{
"valid": true,
"annotations": [...]
}
// Errors:
{
"valid": false,
"errors": [...]
}
I suppose you could also have a "verbose failed" that just returns {"valid":false}
, though I doubt its usefulness.
I'm still in favor of hierarchical structure with variable/optional level of deepness (say verbosity).
And also what do you think about using data pointers strings as keys in map of error instead of having arrays of errors? That would allow to check if particular field in data has failed or passed without iterating.
@vearutop I explored using pointers as keys. The problem is that a single field can fail with multiple errors. You'd have to array the errors in the end, anyway.
While I do like the instance-based approach, I'm finding it difficult to figure out how I'd implement it. Also, the majority favors the flat format, and a few existing implementations (mine included) already use it to some extent.
I think we should not choose an approach by ease of implementation only. We should also strive to make the validation result actually useful for the client.
I’m not convinced how a flat format would allow the consumer to track down the exact chain of reasons why validation failed. The example validation result linked in https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-400553752 is really unhelpful in that.
@yurikhan a flat format would allow the consumer to track down the exact chain of reasons why validation failed
I see where you're coming from, and I think that omitting errors like A subschema failed validation
(from my invalid flat example) may help clear some of that up. In the flat format those errors admittedly hide the real issue, which is (in this case) The value '3' is not greater than or equal to '10'.
From there, the client can use the instance pointer to determine precisely where the problem is.
That is so only in the simplest cases, namely, logical conjunction of conditions.
Consider a moderately complex schema involving a couple of disjunctions:
{"type": "object", "properties": {
"foo": {"type": "number", "anyOf": [
{"maximum": 5},
{"minimum": 15}
]},
"bar": {"type": "integer", "anyOf": [
{"multipleOf": 3},
{"multipleOf": 5}
]}
}}
The value {"foo": 10, "bar": 2}
is invalid against this schema, for 4 atomic reasons.
A hierarchical response will say (paraphrasing from json): “it is invalid because #/foo
(10) is neither less than 5 nor greater than 15, and #/bar
(2) is neither a multiple of 3 nor a multiple of 5”. See, in this simple case, I didn’t even have to quote schema references.
A flat response will say… what?
#/properties/foo/anyOf/0/maximum
, #/foo
(10) should be less than 5;#/properties/foo/anyOf/1/minimum
, #/foo
(10) should be greater than 15;#/properties/foo/anyOf
, #/foo
(10) should satisfy at least one subschema;#/properties/bar/anyOf/0/multipleOf
, #/bar
(2) should be a multiple of 3;#/properties/bar/anyOf/1/multipleOf
, #/bar
(2) should be a multiple of 5;#/properties/bar/anyOf
, #/bar
(2) should satisfy at least one subschema”?Imagine just a few more properties, and, as a human reading that list, I have to resort to pencil and paper to reconstruct the hierarchy. Understanding the exact nature of the violation(s) from the validation response is marginally easier than just manually re-validating against the schema; and I honestly don’t see how to write code that turns that list into a message suitable for presenting to an end user (who knows their way around JSON but not necessarily a Schema wizard).
And we have not even broken out if
/then
/else
yet, which are mostly redundant syntax sugar in terms of “valid/invalid” boolean logical expressive power, but are theoretically capable of getting us a long way towards more understandable “invalid because” violation reports.
The flat response is only flat in regard to listing the JSON instance values. Within each, the annotations/errors are grouped; we don't have an entry for each data/result combination.
So the flat response will say...
/foo
is invalid because
/bar
is invalid because
So, yeah, there has to be some hierarchy to account for *Of
. But we group by the data location.
Edit
The result object has a collection of annotations/errors to support multiple entries for that data point.
I like this. We're getting closer...
Grouping by data location only makes sense for conditions that are members of the top-level conjunction. What about more complex conditions involving multiple properties?
{"type": "object", "anyOf": [
{"properties": {"foo": {"enum": ["a"]}}},
{"properties": {"bar": {"multipleOf": 2}}}
]}
{"foo": "b", "bar": 3}
is invalid because…?
You're still grouping by data location.
/foo
is invalid because
["a"]
(from schema location /anyOf/0/properties/foo/enum
)/bar
is invalid because
/anyOf/1/properties/bar/multipleOf
)If you wanted to include that the entire thing fails because /anyOf
didn't match any of the subschemas, then I would expect the hierarchical layout to be better. But if you can infer that that subschema failed from the errors above, then I think the list would be better.
I think grouping by data location makes sense because naturally JSON Schema does not allow cross data references (at least so far) and every violation is tightly coupled with some exact place in data.
{
"/somewhere/in/data": [<violations>],
"/another/place/in/data": [<violations>]
}
Oh, indeed https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-402051319 in this case violations are caused by several locations
I'm still not very comfortable with pointers as keys, but it is more terse. It probably just comes from my experience in .Net. I like keys to have valid property names.
And I do like the idea of having an encompassing object that holds the overall result and a list of annotations/errors.
@gregsdennis yes, now I agree, https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-402051319 this example shows very well that you can not always split violations in a map
Every violation is coupled with an exact place, yes. But each violated subschema has a different role in the validity of the root instance.
Let me construct an example where a single location participates via two different paths:
{"anyOf": [
{"properties": {
"foo": {"enum": ["a"]},
"bar": {"multipleOf": 3}
}},
{"properties": {
"foo": {"enum": ["b"]},
"bar": {"multipleOf": 5}
}}
]}
No matter how you slice it, grouping that by location is not going to be helpful.
@yurikhan that's a good example.
Instance data of {"foo":"a","bar":10}
would fail, but how do you indicate the failure?
foo
passes on /anyof/0
, but then bar
fails therebar
passes on /anyof/1
, but then foo
fails thereThe full instance only passes on one portion of each of the subschemas. To do this as a list would be
/
/anyOf
)/foo
["b"]
(from /anyOf/1/foo
)/bar
/anyOf/0/bar
)Without the first item, there's no way to indicate that only one of these requirements needs to be met.
Maybe that's how the hierarchy helps. That would result in
/
/anyOf
)/foo
["b"]
(from /anyOf/1/foo
)/bar
/anyOf/0/bar
)This at least includes the result from the anyOf
in a way that indicates some sort of relationship.
This brings us back to the discussion of "do we base the results on the schema or the instance?" I think this example shows that grouping based on either leaves the other lacking organization. If that's the case, maybe we shouldn't group at all: just have a list of objects where each object represents a single annotation/error of a single data point. Then the client can decide how to group them.
[
{
"data":"/",
"keyword":"/anyOf",
"message":"at least one subschema should pass"
},
{
"data":"/bar",
"keyword":"/anyOf/0/bar",
"message":"value must be a multiple of 3"
},
{
"data":"/foo",
"keyword":"/anyOf/1/foo",
"message":"value must be in the set ["b"]"
},
]
This should make everyone equally unhappy.
Instance data of
{"foo":"a","bar":10}
would fail, but how do you indicate the failure?
For completeness, could you please analyze {"foo": "c", "bar": 8}
case? I think it’s even more instructive than the one above which satisfies half of each path.
This should make everyone equally unhappy.
I will refrain from commenting on this metric of success.
Sure thing, @yurikhan.
List format:
/
/anyOf
)/foo
["a"]
(from /anyOf/0/foo
) A["b"]
(from /anyOf/1/foo
) B/bar
/anyOf/0/bar
) C/anyOf/1/bar
) DAgain, no indication (other than schema paths) that the requirement is A&C or B&D. Help us if it were a oneOf
instead of anyOf
.
Hierarchy format:
/
/anyOf
)/foo
["a"]
(from /anyOf/0/foo
) A["b"]
(from /anyOf/1/foo
) B/bar
/anyOf/0/bar
) C/anyOf/1/bar
) DHere, too, there no indication of the relationships between A/C and B/D.
Neither really does the job perfectly. It may just be that applications need to know that there will be some level of interpretation of the results.
I think the real goal here is to determine the most ideal solution. There's no guarantee that we can have an "interpretationless" format.
See, this is why I’m opposed to a format that tries to structure itself after the instance: It cannot preserve the essential semantics of many validation results.
On the other hand, in a hierarchical format structured after the schema, it’s straightforward. For {"foo": "c", "bar": 8}
:
/
does not satisfy /
because it should satisfy at least one of two subschemas of /anyOf
, but:
/
does not satisfy /anyOf/0
because:"c"
(/foo
) is not any of ["a"]
(/anyOf/0/properties/foo/enum
), and8
(/bar
) is not a multiple of 3
(/anyOf/0/properties/bar/multipleOf
)/
does not satisfy /anyOf/1
because:"c"
(/foo
) is not any of ["b"]
(/anyOf/1/properties/foo/enum
), and8
(/bar
) is not a multiple of 5
(/anyOf/1/properties/bar/multipleOf
)For {"foo": "a", "bar": 10}:
/
does not satisfy /
because it should satisfy at least one of two subschemas of /anyOf
, but:
/
does not satisfy /anyOf/0
because:10
(/bar
) is not a multiple of 3
(/anyOf/0/properties/bar/multipleOf
)/
does not satisfy /anyOf/1
because:"a"
(/foo
) is not any of ["b"]
(/anyOf/1/properties/foo/enum
)This brings up the reason I originally proposed the instance-oriented format. I searched for it, and I fear it's likely in a Slack discussion somewhere, but the idea was that a schema-oriented format would require the client to scan for errors that occurred due to a single instance location.
Like I said before, the application is going to have to do some level of analysis. If we choose an instance-oriented output, the application will have to scan for schema locations. If we choose a schema-oriented output, the application will have to scan for instance locations.
It's a matter of perspective and what the client deems important, I suppose. I don't know that we can make that decision for all clients.
There is an alternative: we could require (or hard-suggest) that implementation support both formats. I'm not terribly excited about that, though.
a schema-oriented format would require the client to scan for errors that occurred due to a single instance location.
What’s the use case there? Do you want to be able to tell the user, “the value of this property here is invalid”? I’m afraid you cannot reliably do that.
With the full expressive power of JSON Schema, the value of a single location in the instance is not invalid by itself. Rather, a specific combination of multiple locations of the instance makes the whole instance invalid, and you cannot point your finger and say which of them did that, because changing any of them could make the instance valid again.
(Well, if you avoid using anyOf
, oneOf
, not
, dependencies
, or if
/then
/else
, then you might be able to talk about each specific location being valid or invalid. But that limits the usefulness of schema validation severely.)
TL;DR & leaving some notes here:
@yurikhan I think in the end I was just looking to suggest an alternative approach and provoke discussion.
If we decide to go with the schema-based approach, we will definitely need to figure out how to represent $ref
locations within the output structure. The instance-based approach doesn't have that need; instead they are handled via the source
pointer and directSource
URI in the result object.
@yurikhan @vearutop @gregsdennis I find the discussion to be focusing on the wrong goal, the desired output is not designed for human consumption alone, that is, it should be readable, but human readability is not the prime goal, that places form over function, I'm not aware of many implementations, if any, that would show the untouched error response data in the gui to the end user.
Instead we are trying to create the most usable and extensible format.
An array offers the ultimate flexibility for the consumer with the lowest performance impact. If you want hierarchy format, get the library creator to offer that as an option. If you can come up with detail that could be provided to make hierarchy generation and manipulation easier, then we should absolutely look at how to make the content of the EDO easier for everyone to consume. In fact we could even recommend the structure of such an optional output for consistency if there is enough desire for it, but it isn't logical for it to be the primary format of the output that everyone must generate and/or consume.
@erosb I think including instance data as mandatory was killed off a while back.
Thank you @Anthropic. I completely agree that we should focus on applications as the primary consumers. Applications can then transform the results to whatever format is best suitable for their users. If implementations want to offer alternate formats (grouping, etc.), that is their prerogative. But as a minimum, the format should be easily consumable by a computer. I agree that an ungrouped, flat list facilitates that best.
@gregsdennis So then the question becomes, would the following suffice as the minimum EDO?
{
"keyword": "enum",
"dataPath": "/root/zoo",
"schemaURI": "#/cde/anyOf/0/enum",
"schemaPath": "#/properties/root/patternPropertie…Ba-zA-Z0-9_%5D%2B%24/oneOf/2/anyOf/0/enum",
"messageId": "4",
"message": "..."
}
This would then be within either the error or annotation keyword of the root format described previously.
Perhaps message/messageId can be discussed elsewhere to determine what may be the minimum required for computation or producing message templates...
@Anthropic
In my experience, flattening a hierarchy is almost always easier to implement than reconstructing it from a flat list. Could you please walk us through an example of such reconstruction? (Especially in the face of $ref
in subschemas.)
I agree with @Anthropic, I can only add that: 1) development time/cost is in most cases a higher priority than performance - and it will only be more so going forward. 2) for the majority of use cases array is sufficient, and given that it’s simpler to create and to iterate with any language primitives than trees it is a better choice, even if constructing tree from array can be marginally more complex.
The list of errors from validation can be seen as something similar to an exception stack trace - a linear representation of the tree of functions that led to an error, from the inner to the outer. I don’t remember anybody complaining about the problem with linear stack traces.
@epoberezkin
You don’t have problems with linear stack traces because they are linear, and stack traces are linear because they reflect the state of a single thread of execution. If you have a function foo
trying to call bar
, falling back to baz
and then to quux
, and succeeding if any of those succeed, then you only get the exception and stack trace from quux
.
An anyOf
schema keyword, on the other hand, can be thought of as forking to execute all of subschemas in parallel, then joining and reporting all exceptions if no branch succeeded.
@yurikhan In my experience, flattening a hierarchy is almost always easier to implement than reconstructing it from a flat list.
I understand that you're speaking from your experience, but you can't generalize this kind of thing.
Consider that if we end up using a schema-based hierarchical format, but the client wants an instance-based hierarchical format, they first have to linearize the result before they can restructure it. The same is true for the reverse.
But if we use a list format, those who want either hierarchy potentially have one fewer step to work out. And it shouldn't be a difficult task for implementations to output a list considering that many already do.
In my experience, flattening a hierarchy is almost always easier to implement than reconstructing it from a flat list. Could you please walk us through an example of such reconstruction? (Especially in the face of $ref in subschemas.)
@yurikhan I suspect there is a communication issue here, my previous post already contains enough information to conclude why that is not relevant.
I'd like to ensure you are comfortable with the outcome, can you please message me in Slack so I can work through it with you?
@Anthropic I must disagree, this thread has plenty of reason why hierarchy IS relevant.
I don't think postprocessing is trivial, at least you will have to have JSON Pointer parser utility at hands.
I also don't think performance aspect is important here, as I consider having this response optional upgrade from true/false validation status (it is not a must to create such error response for every validation session).
Also, having the spec for format does not prevent implementation from having optional optimizations. Like for example "do not collect logical subschema errors to the response", in that case you would end up with nice flat array. If the consumer is not interested in deeper details why would you collect subschema errors anyway? And on the opposite why would the consumer build hierarchy himself rather than just get and use it in spec format?
@Anthropic (Re: Slack) No, I’d much prefer to keep discussion public.
As a producer of instances that may be invalid and an end user of a JSON Schema validator, I want a readable error message, so that by reading it I immediately know what is wrong with the instance (and thus my code that generated it). In cases when that cannot be decided automatically, I want the error message to contain enough structure that I could quickly dismiss irrelevant violations and focus on the relevant part. (In the “(foo
equal to "a"
and bar
multipleOf 3
) or (foo
equal to "b"
and bar
multipleOf 5
)” example with instance {"foo": "a", "bar": 10}
presented upthread, I probably know I meant the "a"
case, but the implementation cannot know that.)
As an application developer, I want to be able to easily generate error messages of the kind described above, so that possible bugs in the validation report to error message transformation code cannot cloud the real issue. Ideally, I would achieve that by doing a node-to-node, structure-preserving transformation from a hierarchical validation report object to a hierarchical error message object; but I am willing to settle for having to reconstruct the hierarchy from a flat list if it is demonstrated to be sufficiently easy.
As a validation library contributor, I know the desired structure is not very hard or expensive to maintain within the implementation. I further know that structure naturally emerges as a consequence of the validation logic.
As a potential specification contributor, I note that, until now, all validation has been defined recursively (“An instance validates successfully against this keyword [anyOf
] if it validates successfully against at least one schema defined by this keyword’s value”). It will be straightforward to extend that if the validation result format is hierarchical with structure based on the schema, leading to a specification that is not much harder to comprehend than its current version.
@yurikhan @Anthropic It's perfectly fine to have off-issue discussions, and then come back to document the result of the discussion. I personally find this preferable to mega-theads, which then become unwhealdy to anyone who hasn't kept up raises hand. Although, I also respect if you don't want to use slack at all. longer discussions with summary comments with the other party agreeing on the summary is nicer when there are issues of confusion or dissagreement =]
My 2c, the errors presented to the user by https://www.jsonschemavalidator.net are pretty handy for schema debugging, including the attachment to the instance json! I haven't looked into how it's represented in code, but that implementation should be considered as an example which is functionally useful.
If discussion is going to continue here, I would expect that once a general consensus can be reached, a new summary can be created (either at the end or as a new issue). I'll defer to @handrews on this. A reminder: general consensus doesn't mean that everyone must agree, but generally it looks like everyone was understood, and most people seem to agree on a direction to move on so a PR can be started. PR will get feedback to make it tighter, and that's OK; we don't need to nail it down completely here and now.
For quick reference, here is sample result output of https://www.jsonschemavalidator.net/
{"foo":"f","bar":12}
validated with https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-402069476
{
"valid": false,
"schemaErrors": [],
"validationErrors": [
{
"message": "JSON does not match any schemas from 'anyOf'.",
"lineNumber": 1,
"linePosition": 20,
"path": "",
"value": null,
"schemaId": "#",
"schemaBaseUri": null,
"errorType": "anyOf",
"childErrors": [
{
"message": "Value \"f\" is not defined in enum.",
"lineNumber": 1,
"linePosition": 10,
"path": "foo",
"value": "f",
"schemaId": "#/anyOf/1/properties/foo",
"schemaBaseUri": null,
"errorType": "enum",
"childErrors": []
},
{
"message": "Value \"f\" is not defined in enum.",
"lineNumber": 1,
"linePosition": 10,
"path": "foo",
"value": "f",
"schemaId": "#/anyOf/0/properties/foo",
"schemaBaseUri": null,
"errorType": "enum",
"childErrors": []
},
{
"message": "Integer 12 is not a multiple of 5.",
"lineNumber": 1,
"linePosition": 19,
"path": "bar",
"value": 12,
"schemaId": "#/anyOf/1/properties/bar",
"schemaBaseUri": null,
"errorType": "multipleOf",
"childErrors": []
}
]
}
],
"schemaParseError": null,
"jsonParseError": null
}
@vearutop thanks for providing the sample.
I noticed that it's not purely recursive since at the top level there is validationErrors
but lower levels have childErrors
, but that's nit-picking. As a recursive structure, though, it serves as a good example. I'd even say that there's more info than required (line numbers, etc.).
Personally, I'm still not convinced that one format (flat vs hierarchical) is better than the other, though it seems we have a consensus that schema-based hierarchy is better than instance-based.
We've also pretty much nailed down the days that needs to be contained in each reported error.
At this point, I'm looking for evidence- or theory-based arguments for why the spec should favor a format. It's time to pro/con, people.
It would be nice to come up with a rich example of schema and data, having multiple dereferences, not
, oneOf
to also see the case when missing failure is invalid. And then to assess how helpful are currently implemented error systems.
Also that example could contribute to test suite after we have a consensus.
@vearutop thanks for the example, and your last comment. I was going to suggest something similar, a more analytical approach of what people currently do, and if there's an superset of those implementations which cover as many requirements as are defined to satisfy this issue... probably need to define the requirements clearly also. BUT I realised that was a lot of work for someone to do, and I know I don't have the time.
There was a good chat in Slack so I have added it if you click below.
I've tried to make a sample schema and response output, please check: https://gist.github.com/vearutop/376e2a148e6df044417264722f0d1319
Same schema and test cases with ajv
processing:
https://runkit.com/embed/w5fpvewlnetl
The idea is to provide a flat response (array of causes) for top level schema.
Optionally subschema can produce a recursive array of causes in causeContext
.
A single cause is a structure of
For multiple schema keywords (*Of
), each element creates a separate causeContext
.
References create a new causeContext
allowing you to track $ref
jumps.
I don't like having all causes in a single flat list because non top level causes add noise but does not help you to analyze the root cause until you rebuild flat list into hierarchy.
Hi guys, I am new to json schema, implementing a validation library in Erlang. Very interested in this discussion as a framework developer.
As a suggestion to reach consensus I suggest we summarize the points made here into an Enhancement Proposal, eg in a project wiki page? I have read through this mega-thread and conclusions have already been made, which detailed concisely might align the community.
@expelledboy that's a good idea, and it's been suggested before. I'll try to write up a new issue that summarizes where we are with this one. Then we can close this out.
@vearutop I agree with the idea of the default
output and the causality
output providing the same detail at the root error array so there is no difference in error counts or overall behaviour between the two.
@gregsdennis @vearutop I feel like we could consider:
causeContext
Then for each level the Annotation/Error Description Object could be optionally alternated with an Enhanced Annotation/Error Description Object with more detail, message ids, data, embedded schemas, etc... to be decided... this gives implementations a lot of choice while not requiring anyone to go too far from their comfort zone until enough people request it of them.
I feel an implementation: SHOULD provide basic RECOMMENDED to provide causality and OPTIONALLY provide verbose only in so far as being able to claim they "support the standard annotation/error output". I like the idea of having a minimum requirement before someone can make that claim.
@gregsdennis Could I suggest rather just starting a wiki page, and continuing the discussion here? We may just end up with another mega-thread.
@expelledboy I'm not sure that's in line with the organization of the repo. Also, this topic is still very much in discussion. I'll discuss it over on Slack with the powers that be here.
FWIW, for the purposes of troubleshooting what went wrong and providing the right level of information to the developer (balancing simplicity with verbosity) we settled on this:
https://github.com/retrosight/rest/blob/master/guidelines.md#errors-when-4xx
and
https://github.com/retrosight/rest/blob/master/schema/com-example-error-2018-03-01.schema.json
Let's also discuss how we can expose that required
property is missing.
My proposal based on previous structure is:
For schema {"required": ["foo", "bar"]}
and data {"baz":1}
to have this response
{
"valid": false,
"causeContext": [
{"keyword": "required", "schemaPath": "#/required", "dataPath": "#", "valid": false, "causeContext": [
{"dataPath": "#/foo", "valid": false},
{"dataPath": "#/bar", "valid": false}
]}
]
}
or this response
{
"valid": false,
"causeContext": [
{"keyword":"required", "schemaPath": "#/required", "dataPath": "#/foo", "valid": false},
{"keyword":"required", "schemaPath": "#/required", "dataPath": "#/bar", "valid": false}
]
}
Not sure which one would be better.
I don't know why you'd treat required
differently than any other keyword. It's just a messaging question. "One or more required properties were not found" should be sufficient. If the implementation wished to list the properties, that would be fine, too.
This is about the structure, though, not standardized messaging.
The schema provides a way of validating the input, but there should be a standard for the validation results as well.