Standard schema for results, including errors

trajano commented 7 years ago

The schema provides a way of validating the input, but there should be a standard for the validation results as well.

handrews commented 6 years ago

I've updated PR #610, which covers the annotation collection process and requirements for data collected, but without specifying the exact format. It now also has an example of how some of that information is intended to be used by applications.

@gregsdennis your most recent comment sounds great, except that the resolved schema needs to be a URI, not just a pointer. Schema locations may come from multiple schema documents, so a pointer alone is inadequate.

The schema path can be a pointer, although we'll have to be clear about it being a pointer for the post-resolution data structure (or some such terminology).

Anthropic commented 6 years ago

I find the goal of the output is to provide a guide to the most efficient, implementable, minimal and extensible output for a foundation, that's why hierarchy didn't sit so well with me, recursion being more expensive than flat loops.

@gregsdennis given your suggestion of a simple boolean result do you feel that perhaps the result should be within an object to provide the overall output?

It goes against the mantra I just wrote of being minimal, but it could also make it easier to consistently expect the valid property to exist, but also to be able to know when there are errors vs annotations in just an array result. The annotation and validation processes still remain quite separate.

Do you see the end array result combined or under separate keywords?

// Minimum
{
  "valid": true
}

// Annotation focused:
{
  "valid": true,
  "annotation": [...]
}

// Errors and/or Annotations:
{
  "valid": false,
  "errors": [...],
  "annotations": [...]
}
// or
{
  "valid": false,
  "results": [...errors, ...annotations]
}

gregsdennis commented 6 years ago

That seems logical. 🖖

Errors and annotations should never both be populated since failing validation removes annotations. That said, I like your third example, but only one need appear in the object at a time.

// Minimum
{
  "valid": true
}

// Verbose success:
{
  "valid": true,
  "annotations": [...]
}

// Errors:
{
  "valid": false,
  "errors": [...]
}

I suppose you could also have a "verbose failed" that just returns {"valid":false}, though I doubt its usefulness.

vearutop commented 6 years ago

I'm still in favor of hierarchical structure with variable/optional level of deepness (say verbosity).

And also what do you think about using data pointers strings as keys in map of error instead of having arrays of errors? That would allow to check if particular field in data has failed or passed without iterating.

gregsdennis commented 6 years ago

@vearutop I explored using pointers as keys. The problem is that a single field can fail with multiple errors. You'd have to array the errors in the end, anyway.

While I do like the instance-based approach, I'm finding it difficult to figure out how I'd implement it. Also, the majority favors the flat format, and a few existing implementations (mine included) already use it to some extent.

yurikhan commented 6 years ago

I think we should not choose an approach by ease of implementation only. We should also strive to make the validation result actually useful for the client.

I’m not convinced how a flat format would allow the consumer to track down the exact chain of reasons why validation failed. The example validation result linked in https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-400553752 is really unhelpful in that.

gregsdennis commented 6 years ago

@yurikhan a flat format would allow the consumer to track down the exact chain of reasons why validation failed

I see where you're coming from, and I think that omitting errors like A subschema failed validation (from my invalid flat example) may help clear some of that up. In the flat format those errors admittedly hide the real issue, which is (in this case) The value '3' is not greater than or equal to '10'.

From there, the client can use the instance pointer to determine precisely where the problem is.

yurikhan commented 6 years ago

That is so only in the simplest cases, namely, logical conjunction of conditions.

Consider a moderately complex schema involving a couple of disjunctions:

{"type": "object", "properties": {
  "foo": {"type": "number", "anyOf": [
    {"maximum": 5},
    {"minimum": 15}
  ]},
  "bar": {"type": "integer", "anyOf": [
    {"multipleOf": 3},
    {"multipleOf": 5}
  ]}
}}

The value {"foo": 10, "bar": 2} is invalid against this schema, for 4 atomic reasons.

A hierarchical response will say (paraphrasing from json): “it is invalid because #/foo (10) is neither less than 5 nor greater than 15, and #/bar (2) is neither a multiple of 3 nor a multiple of 5”. See, in this simple case, I didn’t even have to quote schema references.

A flat response will say… what?

“According to #/properties/foo/anyOf/0/maximum, #/foo (10) should be less than 5;
according to #/properties/foo/anyOf/1/minimum, #/foo (10) should be greater than 15;
according to #/properties/foo/anyOf, #/foo (10) should satisfy at least one subschema;
according to #/properties/bar/anyOf/0/multipleOf, #/bar (2) should be a multiple of 3;
according to #/properties/bar/anyOf/1/multipleOf, #/bar (2) should be a multiple of 5;
according to #/properties/bar/anyOf, #/bar (2) should satisfy at least one subschema”?

Imagine just a few more properties, and, as a human reading that list, I have to resort to pencil and paper to reconstruct the hierarchy. Understanding the exact nature of the violation(s) from the validation response is marginally easier than just manually re-validating against the schema; and I honestly don’t see how to write code that turns that list into a message suitable for presenting to an end user (who knows their way around JSON but not necessarily a Schema wizard).

And we have not even broken out if/then/else yet, which are mostly redundant syntax sugar in terms of “valid/invalid” boolean logical expressive power, but are theoretically capable of getting us a long way towards more understandable “invalid because” violation reports.

gregsdennis commented 6 years ago

The flat response is only flat in regard to listing the JSON instance values. Within each, the annotations/errors are grouped; we don't have an entry for each data/result combination.

So the flat response will say...

/foo is invalid because
- the value should meet any of
  - the value should be less than 5
  - the value should be more than 10
/bar is invalid because
- the value should meet any of
  - the value should be a multiple of 3
  - the value should be a multiple of 5

So, yeah, there has to be some hierarchy to account for *Of. But we group by the data location.

Edit

The result object has a collection of annotations/errors to support multiple entries for that data point.

I like this. We're getting closer...

yurikhan commented 6 years ago

Grouping by data location only makes sense for conditions that are members of the top-level conjunction. What about more complex conditions involving multiple properties?

{"type": "object", "anyOf": [
  {"properties": {"foo": {"enum": ["a"]}}},
  {"properties": {"bar": {"multipleOf": 2}}}
]}

{"foo": "b", "bar": 3} is invalid because…?

gregsdennis commented 6 years ago

You're still grouping by data location.

/foo is invalid because
- the value must be in the set ["a"] (from schema location /anyOf/0/properties/foo/enum)
/bar is invalid because
- the value must be a multiple of 2 (from schema location /anyOf/1/properties/bar/multipleOf)

If you wanted to include that the entire thing fails because /anyOf didn't match any of the subschemas, then I would expect the hierarchical layout to be better. But if you can infer that that subschema failed from the errors above, then I think the list would be better.

vearutop commented 6 years ago

~~I think grouping by data location makes sense because naturally JSON Schema does not allow cross data references (at least so far) and every violation is tightly coupled with some exact place in data.~~

{
 "/somewhere/in/data": [<violations>],
 "/another/place/in/data": [<violations>]
}

Oh, indeed https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-402051319 in this case violations are caused by several locations

gregsdennis commented 6 years ago

I'm still not very comfortable with pointers as keys, but it is more terse. It probably just comes from my experience in .Net. I like keys to have valid property names.

And I do like the idea of having an encompassing object that holds the overall result and a list of annotations/errors.

vearutop commented 6 years ago

@gregsdennis yes, now I agree, https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-402051319 this example shows very well that you can not always split violations in a map

yurikhan commented 6 years ago

Every violation is coupled with an exact place, yes. But each violated subschema has a different role in the validity of the root instance.

Let me construct an example where a single location participates via two different paths:

{"anyOf": [
  {"properties": {
    "foo": {"enum": ["a"]},
    "bar": {"multipleOf": 3}
  }},
  {"properties": {
    "foo": {"enum": ["b"]},
    "bar": {"multipleOf": 5}
  }}
]}

No matter how you slice it, grouping that by location is not going to be helpful.

gregsdennis commented 6 years ago

@yurikhan that's a good example.

Instance data of {"foo":"a","bar":10} would fail, but how do you indicate the failure?

foo passes on /anyof/0, but then bar fails there
bar passes on /anyof/1, but then foo fails there

The full instance only passes on one portion of each of the subschemas. To do this as a list would be

/
- at least one subschema should pass (from /anyOf)
/foo
- value must be in the set ["b"] (from /anyOf/1/foo)
/bar
- value must be a multiple of 3 (from /anyOf/0/bar)

Without the first item, there's no way to indicate that only one of these requirements needs to be met.

Maybe that's how the hierarchy helps. That would result in

/
- at least one subschema should pass (from /anyOf)
- data
  - /foo
    - value must be in the set ["b"] (from /anyOf/1/foo)
  - /bar
    - value must be a multiple of 3 (from /anyOf/0/bar)

This at least includes the result from the anyOf in a way that indicates some sort of relationship.

This brings us back to the discussion of "do we base the results on the schema or the instance?" I think this example shows that grouping based on either leaves the other lacking organization. If that's the case, maybe we shouldn't group at all: just have a list of objects where each object represents a single annotation/error of a single data point. Then the client can decide how to group them.

[
  {
    "data":"/",
    "keyword":"/anyOf",
    "message":"at least one subschema should pass"
  },
  {
    "data":"/bar",
    "keyword":"/anyOf/0/bar",
    "message":"value must be a multiple of 3"
  },
  {
    "data":"/foo",
    "keyword":"/anyOf/1/foo",
    "message":"value must be in the set ["b"]"
  },
]

This should make everyone equally unhappy.

yurikhan commented 6 years ago

Instance data of {"foo":"a","bar":10} would fail, but how do you indicate the failure?

For completeness, could you please analyze {"foo": "c", "bar": 8} case? I think it’s even more instructive than the one above which satisfies half of each path.

This should make everyone equally unhappy.

I will refrain from commenting on this metric of success.

gregsdennis commented 6 years ago

Sure thing, @yurikhan.

List format:

/
- at least one subschema should pass (from /anyOf)
/foo
- value must be in the set ["a"] (from /anyOf/0/foo) A
- value must be in the set ["b"] (from /anyOf/1/foo) B
/bar
- value must be a multiple of 3 (from /anyOf/0/bar) C
- value must be a multiple of 5 (from /anyOf/1/bar) D

Again, no indication (other than schema paths) that the requirement is A&C or B&D. Help us if it were a oneOf instead of anyOf.

Hierarchy format:

/
- at least one subschema should pass (from /anyOf)
- data
  - /foo
    - value must be in the set ["a"] (from /anyOf/0/foo) A
    - value must be in the set ["b"] (from /anyOf/1/foo) B
  - /bar
    - value must be a multiple of 3 (from /anyOf/0/bar) C
    - value must be a multiple of 5 (from /anyOf/1/bar) D

Here, too, there no indication of the relationships between A/C and B/D.

Neither really does the job perfectly. It may just be that applications need to know that there will be some level of interpretation of the results.

I think the real goal here is to determine the most ideal solution. There's no guarantee that we can have an "interpretationless" format.

yurikhan commented 6 years ago

See, this is why I’m opposed to a format that tries to structure itself after the instance: It cannot preserve the essential semantics of many validation results.

On the other hand, in a hierarchical format structured after the schema, it’s straightforward. For {"foo": "c", "bar": 8}:

/ does not satisfy / because it should satisfy at least one of two subschemas of /anyOf, but:
- / does not satisfy /anyOf/0 because:
- "c" (/foo) is not any of ["a"] (/anyOf/0/properties/foo/enum), and
- 8 (/bar) is not a multiple of 3 (/anyOf/0/properties/bar/multipleOf)
- / does not satisfy /anyOf/1 because:
- "c" (/foo) is not any of ["b"] (/anyOf/1/properties/foo/enum), and
- 8 (/bar) is not a multiple of 5 (/anyOf/1/properties/bar/multipleOf)

For {"foo": "a", "bar": 10}:

/ does not satisfy / because it should satisfy at least one of two subschemas of /anyOf, but:
- / does not satisfy /anyOf/0 because:
- 10 (/bar) is not a multiple of 3 (/anyOf/0/properties/bar/multipleOf)
- / does not satisfy /anyOf/1 because:
- "a" (/foo) is not any of ["b"] (/anyOf/1/properties/foo/enum)

gregsdennis commented 6 years ago

This brings up the reason I originally proposed the instance-oriented format. I searched for it, and I fear it's likely in a Slack discussion somewhere, but the idea was that a schema-oriented format would require the client to scan for errors that occurred due to a single instance location.

Like I said before, the application is going to have to do some level of analysis. If we choose an instance-oriented output, the application will have to scan for schema locations. If we choose a schema-oriented output, the application will have to scan for instance locations.

It's a matter of perspective and what the client deems important, I suppose. I don't know that we can make that decision for all clients.

There is an alternative: we could require (or hard-suggest) that implementation support both formats. I'm not terribly excited about that, though.

yurikhan commented 6 years ago

a schema-oriented format would require the client to scan for errors that occurred due to a single instance location.

What’s the use case there? Do you want to be able to tell the user, “the value of this property here is invalid”? I’m afraid you cannot reliably do that.

With the full expressive power of JSON Schema, the value of a single location in the instance is not invalid by itself. Rather, a specific combination of multiple locations of the instance makes the whole instance invalid, and you cannot point your finger and say which of them did that, because changing any of them could make the instance valid again.

(Well, if you avoid using anyOf, oneOf, not, dependencies, or if/then/else, then you might be able to talk about each specific location being valid or invalid. But that limits the usefulness of schema validation severely.)

erosb commented 6 years ago

TL;DR & leaving some notes here:

the error reporting format of my implementation is documented here - just FYI. It is close to the hierarchy format proposed above. It includes both instance pointer and schema pointer.
I'm torn about including instance data in the standard validation error format (at least make it optional, please), mainly because we don't have control over the size of the invalid data, and it may result in possible performance problems for users.

gregsdennis commented 6 years ago

@yurikhan I think in the end I was just looking to suggest an alternative approach and provoke discussion.

If we decide to go with the schema-based approach, we will definitely need to figure out how to represent $ref locations within the output structure. The instance-based approach doesn't have that need; instead they are handled via the source pointer and directSource URI in the result object.

Anthropic commented 6 years ago

@yurikhan @vearutop @gregsdennis I find the discussion to be focusing on the wrong goal, the desired output is not designed for human consumption alone, that is, it should be readable, but human readability is not the prime goal, that places form over function, I'm not aware of many implementations, if any, that would show the untouched error response data in the gui to the end user.

Instead we are trying to create the most usable and extensible format.

an ungrouped array of errors can be easily transformed to a keyed map
an ungrouped array of errors can be easily transformed to a hierarchy from the provided pointers, at a minimal increase over roughly the same level of performance impact as generating a hierarchy in the first place
an ungrouped array of errors can be easily transformed to a grouped array of errors
almost any desired format in most languages is faster to generate from an array than a recursive deconstruction of a hierarchy
filtering is faster (you can simply filter out any key that contains Of where Of isn't the keyword for example to remove children)

An array offers the ultimate flexibility for the consumer with the lowest performance impact. If you want hierarchy format, get the library creator to offer that as an option. If you can come up with detail that could be provided to make hierarchy generation and manipulation easier, then we should absolutely look at how to make the content of the EDO easier for everyone to consume. In fact we could even recommend the structure of such an optional output for consistency if there is enough desire for it, but it isn't logical for it to be the primary format of the output that everyone must generate and/or consume.

@erosb I think including instance data as mandatory was killed off a while back.

gregsdennis commented 6 years ago

Thank you @Anthropic. I completely agree that we should focus on applications as the primary consumers. Applications can then transform the results to whatever format is best suitable for their users. If implementations want to offer alternate formats (grouping, etc.), that is their prerogative. But as a minimum, the format should be easily consumable by a computer. I agree that an ungrouped, flat list facilitates that best.

Anthropic commented 6 years ago

@gregsdennis So then the question becomes, would the following suffice as the minimum EDO?

{
  "keyword": "enum",
  "dataPath": "/root/zoo",
  "schemaURI": "#/cde/anyOf/0/enum",
  "schemaPath": "#/properties/root/patternPropertie…Ba-zA-Z0-9_%5D%2B%24/oneOf/2/anyOf/0/enum",
  "messageId": "4",
  "message": "..."
}

This would then be within either the error or annotation keyword of the root format described previously.

Perhaps message/messageId can be discussed elsewhere to determine what may be the minimum required for computation or producing message templates...

yurikhan commented 6 years ago

@Anthropic In my experience, flattening a hierarchy is almost always easier to implement than reconstructing it from a flat list. Could you please walk us through an example of such reconstruction? (Especially in the face of $ref in subschemas.)

epoberezkin commented 6 years ago

I agree with @Anthropic, I can only add that: 1) development time/cost is in most cases a higher priority than performance - and it will only be more so going forward. 2) for the majority of use cases array is sufficient, and given that it’s simpler to create and to iterate with any language primitives than trees it is a better choice, even if constructing tree from array can be marginally more complex.

The list of errors from validation can be seen as something similar to an exception stack trace - a linear representation of the tree of functions that led to an error, from the inner to the outer. I don’t remember anybody complaining about the problem with linear stack traces.

yurikhan commented 6 years ago

@epoberezkin You don’t have problems with linear stack traces because they are linear, and stack traces are linear because they reflect the state of a single thread of execution. If you have a function foo trying to call bar, falling back to baz and then to quux, and succeeding if any of those succeed, then you only get the exception and stack trace from quux.

An anyOf schema keyword, on the other hand, can be thought of as forking to execute all of subschemas in parallel, then joining and reporting all exceptions if no branch succeeded.

gregsdennis commented 6 years ago

@yurikhan In my experience, flattening a hierarchy is almost always easier to implement than reconstructing it from a flat list.

I understand that you're speaking from your experience, but you can't generalize this kind of thing.

Consider that if we end up using a schema-based hierarchical format, but the client wants an instance-based hierarchical format, they first have to linearize the result before they can restructure it. The same is true for the reverse.

But if we use a list format, those who want either hierarchy potentially have one fewer step to work out. And it shouldn't be a difficult task for implementations to output a list considering that many already do.

Anthropic commented 6 years ago

In my experience, flattening a hierarchy is almost always easier to implement than reconstructing it from a flat list. Could you please walk us through an example of such reconstruction? (Especially in the face of $ref in subschemas.)

@yurikhan I suspect there is a communication issue here, my previous post already contains enough information to conclude why that is not relevant.

I'd like to ensure you are comfortable with the outcome, can you please message me in Slack so I can work through it with you?

vearutop commented 6 years ago

@Anthropic I must disagree, this thread has plenty of reason why hierarchy IS relevant.

I don't think postprocessing is trivial, at least you will have to have JSON Pointer parser utility at hands.

I also don't think performance aspect is important here, as I consider having this response optional upgrade from true/false validation status (it is not a must to create such error response for every validation session).

Also, having the spec for format does not prevent implementation from having optional optimizations. Like for example "do not collect logical subschema errors to the response", in that case you would end up with nice flat array. If the consumer is not interested in deeper details why would you collect subschema errors anyway? And on the opposite why would the consumer build hierarchy himself rather than just get and use it in spec format?

yurikhan commented 6 years ago

@Anthropic (Re: Slack) No, I’d much prefer to keep discussion public.

As a producer of instances that may be invalid and an end user of a JSON Schema validator, I want a readable error message, so that by reading it I immediately know what is wrong with the instance (and thus my code that generated it). In cases when that cannot be decided automatically, I want the error message to contain enough structure that I could quickly dismiss irrelevant violations and focus on the relevant part. (In the “(foo equal to "a" and bar multipleOf 3) or (foo equal to "b" and bar multipleOf 5)” example with instance {"foo": "a", "bar": 10} presented upthread, I probably know I meant the "a" case, but the implementation cannot know that.)

As an application developer, I want to be able to easily generate error messages of the kind described above, so that possible bugs in the validation report to error message transformation code cannot cloud the real issue. Ideally, I would achieve that by doing a node-to-node, structure-preserving transformation from a hierarchical validation report object to a hierarchical error message object; but I am willing to settle for having to reconstruct the hierarchy from a flat list if it is demonstrated to be sufficiently easy.

As a validation library contributor, I know the desired structure is not very hard or expensive to maintain within the implementation. I further know that structure naturally emerges as a consequence of the validation logic.

As a potential specification contributor, I note that, until now, all validation has been defined recursively (“An instance validates successfully against this keyword [anyOf] if it validates successfully against at least one schema defined by this keyword’s value”). It will be straightforward to extend that if the validation result format is hierarchical with structure based on the schema, leading to a specification that is not much harder to comprehend than its current version.

Relequestual commented 6 years ago

@yurikhan @Anthropic It's perfectly fine to have off-issue discussions, and then come back to document the result of the discussion. I personally find this preferable to mega-theads, which then become unwhealdy to anyone who hasn't kept up raises hand. Although, I also respect if you don't want to use slack at all. longer discussions with summary comments with the other party agreeing on the summary is nicer when there are issues of confusion or dissagreement =]

Relequestual commented 6 years ago

My 2c, the errors presented to the user by https://www.jsonschemavalidator.net are pretty handy for schema debugging, including the attachment to the instance json! I haven't looked into how it's represented in code, but that implementation should be considered as an example which is functionally useful.

Relequestual commented 6 years ago

If discussion is going to continue here, I would expect that once a general consensus can be reached, a new summary can be created (either at the end or as a new issue). I'll defer to @handrews on this. A reminder: general consensus doesn't mean that everyone must agree, but generally it looks like everyone was understood, and most people seem to agree on a direction to move on so a PR can be started. PR will get feedback to make it tighter, and that's OK; we don't need to nail it down completely here and now.

vearutop commented 6 years ago

For quick reference, here is sample result output of https://www.jsonschemavalidator.net/ {"foo":"f","bar":12} validated with https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-402069476

{
  "valid": false,
  "schemaErrors": [],
  "validationErrors": [
    {
      "message": "JSON does not match any schemas from 'anyOf'.",
      "lineNumber": 1,
      "linePosition": 20,
      "path": "",
      "value": null,
      "schemaId": "#",
      "schemaBaseUri": null,
      "errorType": "anyOf",
      "childErrors": [
        {
          "message": "Value \"f\" is not defined in enum.",
          "lineNumber": 1,
          "linePosition": 10,
          "path": "foo",
          "value": "f",
          "schemaId": "#/anyOf/1/properties/foo",
          "schemaBaseUri": null,
          "errorType": "enum",
          "childErrors": []
        },
        {
          "message": "Value \"f\" is not defined in enum.",
          "lineNumber": 1,
          "linePosition": 10,
          "path": "foo",
          "value": "f",
          "schemaId": "#/anyOf/0/properties/foo",
          "schemaBaseUri": null,
          "errorType": "enum",
          "childErrors": []
        },
        {
          "message": "Integer 12 is not a multiple of 5.",
          "lineNumber": 1,
          "linePosition": 19,
          "path": "bar",
          "value": 12,
          "schemaId": "#/anyOf/1/properties/bar",
          "schemaBaseUri": null,
          "errorType": "multipleOf",
          "childErrors": []
        }
      ]
    }
  ],
  "schemaParseError": null,
  "jsonParseError": null
}

gregsdennis commented 6 years ago

@vearutop thanks for providing the sample.

I noticed that it's not purely recursive since at the top level there is validationErrors but lower levels have childErrors, but that's nit-picking. As a recursive structure, though, it serves as a good example. I'd even say that there's more info than required (line numbers, etc.).

Personally, I'm still not convinced that one format (flat vs hierarchical) is better than the other, though it seems we have a consensus that schema-based hierarchy is better than instance-based.

We've also pretty much nailed down the days that needs to be contained in each reported error.

At this point, I'm looking for evidence- or theory-based arguments for why the spec should favor a format. It's time to pro/con, people.

vearutop commented 6 years ago

It would be nice to come up with a rich example of schema and data, having multiple dereferences, not, oneOf to also see the case when missing failure is invalid. And then to assess how helpful are currently implemented error systems.

Also that example could contribute to test suite after we have a consensus.

Relequestual commented 6 years ago

@vearutop thanks for the example, and your last comment. I was going to suggest something similar, a more analytical approach of what people currently do, and if there's an superset of those implementations which cover as many requirements as are defined to satisfy this issue... probably need to define the requirements clearly also. BUT I realised that was a lot of work for someone to do, and I know I don't have the time.

Anthropic commented 6 years ago

Annotation Discussion on Slack

There was a good chat in Slack so I have added it if you click below.

View annotation conversation

@vearutop: I am totally lost on annotations topic :slightly_smiling_face: would appreciate a simple example @handrews: @vearutop do you follow that `title`, `description`, `readOnly`, `writeOnly`, `default`, and `examples` are annotations? (They are not the only ones, but they are the most straightforward so if this is not clear we should start with explaining that) @vearutop: I am using few of those in code generation title, description, default at least @handrews: Right. So these do not impact validation at all, correct? @vearutop: yes @handrews: They do, however, attach to a particular part of the instance. `title` in the root schema is presumably the title of the whole instance `title` within a particular `properties` subschema is presumably the title of that property in the instance In this sense, title _annotates_ the instance with information with me so far? @vearutop: yes, but still not sure to get the collection meaning @handrews: right ,I’m getting to that this is just to make sure that we’re on the same page on the general concept of an annotation so, to illustrate why “collection” is important, what is the meaning of `{"allOf": [{"title": "Foo"}, {"title": "Bar"}]}`? is the title “Foo” or “Bar”? And why? There is not an obvious universal answer @vearutop: let me explain :slightly_smiling_face: @handrews: ok @vearutop: if that is a code generation, title: Foo would be a class of AllOf0, and title: Bar as class of AllOf1 AllOf would be a union of those @handrews: the key thing here is _if that is a code generation_ you know how to interpret that because you have a specific application in mind in order to do that, you need to know: 1. That both of these are from an `allOf` 2. That “Foo” is from the first, and “Bar” is from the second This is because your application (code generation) assigns importance to the use of an `allOf` and the order of schemas within it (in contrast with validation, which does not care about the ordering at all) If you are writing a dedicated code generator, you can hardwire all of that. You know what you’re doing, you can write your assumptions into code. It will run faster that way. However, if someone is writing a generic library for processing JSON Schema, which many other applications will use, they cannot do taht Even if it’s for code generation, someone else’s code generator may assign different semantics to the use of `allOf` So “collecting” annotations means that the generic library provides information on: 1. What keyword produced the annotation (`title`) 2. Where it came from in the schema (the first subschema of `allOf` in the root schema) 3. To where it is attached in the instance (the root of the instance) (edited) Given that information, your code generator would now know enough about those titles to figure out how to use them. Is that the most efficient way to write a code generator? Nope. But it’s an approach that will work across many applications. Flexibility vs efficiency. [I’m handwaving past some differences in working with and without instances, which @gregsdennis recently highlighted] @vearutop: so it could be an `[{"keyword":"title","schemaPath":"#/allOf/0","dataPath":"#","value":"Foo"},...]`? (edited) @handrews: Yup! @vearutop: hmhm, but usually it is not a big deal to traverse the schema recursively thanks for explanation @handrews: @vearutop for a well-known application, especially one like code generation that works without instances, I would absolutely just recommend schema traversal it’s why I wrote `@cloudflare/json-schema-walker` in JavaScript For hyper-schema, where the `links` keyword is a complex annotation whose presence and value depends heavily on the instance, you want to be able to ask “what links exist for this instance”, and understand from where they came and to what data they are attached. With an instance present, and validation keywords in use, you need to implement validation in order to traverse, so you might as well do the collection as part of that. [I have to go get lunch now, I’ll check in later] @vearutop: okok, makes good sense if the collection of annotations is tied to validation @handrews: @vearutop as @gregsdennis observed the other day, *collection* of annotations is really a thing that happens with instances, and therefore needs to take validation into account. There is some other concept of using annotations by statically walking a schema without an instance that is relevant to code generation, ui generation, documentation, etc. that needs a name. I do not know what that name is. @gregsdennis: Please also note that I come from a validation point of view. Following this conversation has illuminated the other uses a bit more for me. @vearutop since you have more experience with the code generation side of this, is it distinctly easier to have a recursive output for this and possibly other static analysis purposes? @anthropic: @gregsdennis @handrews @vearutop I've had someone who does code generation tell me they convert the schema into an IR (intermediate representation) internally. Even within my own application we process the schema first, doing a transformation over the whole thing as we merge it with our array based ui schema to create our own IR for machine processing. Henry, like Greg, I find this an illuminating conversation, in fact it has been more interesting to me than the bulk of the 90 comments on the error structure issue, perhaps some of it should be added there since Slack will delete this in future?

vearutop commented 6 years ago

I've tried to make a sample schema and response output, please check: https://gist.github.com/vearutop/376e2a148e6df044417264722f0d1319

Same schema and test cases with ajv processing: https://runkit.com/embed/w5fpvewlnetl

The idea is to provide a flat response (array of causes) for top level schema. Optionally subschema can produce a recursive array of causes in causeContext.

A single cause is a structure of

keyword (not present for boolean schema)
schemaPath
dataPath
valid (true/false)
causeContext ( a list of causes where schema of current cause acts as a top level schema)

For multiple schema keywords (*Of), each element creates a separate causeContext.

References create a new causeContext allowing you to track $ref jumps.

I don't like having all causes in a single flat list because non top level causes add noise but does not help you to analyze the root cause until you rebuild flat list into hierarchy.

expelledboy commented 6 years ago

Hi guys, I am new to json schema, implementing a validation library in Erlang. Very interested in this discussion as a framework developer.

As a suggestion to reach consensus I suggest we summarize the points made here into an Enhancement Proposal, eg in a project wiki page? I have read through this mega-thread and conclusions have already been made, which detailed concisely might align the community.

gregsdennis commented 6 years ago

@expelledboy that's a good idea, and it's been suggested before. I'll try to write up a new issue that summarizes where we are with this one. Then we can close this out.

Anthropic commented 6 years ago

@vearutop I agree with the idea of the default output and the causality output providing the same detail at the root error array so there is no difference in error counts or overall behaviour between the two.

@gregsdennis @vearutop I feel like we could consider:

basic (default): the flat array minimal output
causality: as basic, but with @vearutop 's causeContext
verbose: a hierarchy

Then for each level the Annotation/Error Description Object could be optionally alternated with an Enhanced Annotation/Error Description Object with more detail, message ids, data, embedded schemas, etc... to be decided... this gives implementations a lot of choice while not requiring anyone to go too far from their comfort zone until enough people request it of them.

I feel an implementation: SHOULD provide basic RECOMMENDED to provide causality and OPTIONALLY provide verbose only in so far as being able to claim they "support the standard annotation/error output". I like the idea of having a minimum requirement before someone can make that claim.

expelledboy commented 6 years ago

@gregsdennis Could I suggest rather just starting a wiki page, and continuing the discussion here? We may just end up with another mega-thread.

gregsdennis commented 6 years ago

@expelledboy I'm not sure that's in line with the organization of the repo. Also, this topic is still very much in discussion. I'll discuss it over on Slack with the powers that be here.

retrosight commented 6 years ago

FWIW, for the purposes of troubleshooting what went wrong and providing the right level of information to the developer (balancing simplicity with verbosity) we settled on this:

https://github.com/retrosight/rest/blob/master/guidelines.md#errors-when-4xx

and

https://github.com/retrosight/rest/blob/master/schema/com-example-error-2018-03-01.schema.json

vearutop commented 6 years ago

Let's also discuss how we can expose that required property is missing.

My proposal based on previous structure is: For schema {"required": ["foo", "bar"]} and data {"baz":1} to have this response

{
  "valid": false,
  "causeContext": [
    {"keyword": "required", "schemaPath": "#/required", "dataPath": "#", "valid": false, "causeContext": [
      {"dataPath": "#/foo", "valid": false},
      {"dataPath": "#/bar", "valid": false}
    ]}
  ]
}

or this response

{
  "valid": false,
  "causeContext": [
    {"keyword":"required", "schemaPath": "#/required", "dataPath": "#/foo", "valid": false},
    {"keyword":"required", "schemaPath": "#/required", "dataPath": "#/bar", "valid": false}
  ]
}

Not sure which one would be better.

gregsdennis commented 6 years ago

I don't know why you'd treat required differently than any other keyword. It's just a messaging question. "One or more required properties were not found" should be sufficient. If the implementation wished to list the properties, that would be fine, too.

This is about the structure, though, not standardized messaging.

json-schema-org / json-schema-spec

Standard schema for results, including errors #396

Annotation Discussion on Slack