Standard schema for results, including errors

trajano commented 6 years ago

The schema provides a way of validating the input, but there should be a standard for the validation results as well.

handrews commented 6 years ago

EDIT: This is now tracking overall output formats, success or failure. See also #530 for more specifics about annotation output.

@trajano could you elaborate on this a bit more? Are you talking about input to a specific validator? Or input/results from some other system that you want to use validation with?

trajano commented 6 years ago

I was thinking that we should have something along the lines of a validation result structure akin to https://docs.oracle.com/javaee/7/api/javax/validation/ConstraintViolation.html

For example:

{
   "valid": true | false,
   "violations": [
      {
         "property-path" : "$/path/to[10]/bad/element",
         "invalid-value": "who's bad" | { "some": [ "complex", "object" ] }
         "rule-path": "$/path/in/schema/that/triggered/violation",
     } ,
      {
         "property-path" : "$/path/to[10]/bad/element",
         "invalid-value": "who's bad" | { "some": [ "complex", "object" ] }
         "rule-path": "$/path/in/schema/that/triggered/violation",
          // these may be optionally added but should not be part of the spec
         "message-template": "{0} is who is bad",
         "message": "who's bad is who is bad",
         "position": { "line": 52", "col": 10 }
     } 
    ]
}

trajano commented 6 years ago

That way each application need not develop their own validation error standard messages

handrews commented 6 years ago

@trajano I like the idea, in some form or another.

RFCs generally do not constrain implementations in this way, due to the wide variety of implementation languages, environments, and requirements such as performance vs usability. This idea helps interoperability among users of validators, rather than interoperability in schema processing and outcome. So I don't think this proposal belongs in the spec repo (at least not with its current scope).

I see two possible approaches:

Write a schema for the desired output and perhaps we can discuss it with implementation maintainers and publish it as a recommendation on the web site (which is a separate GitHub repo). Depending on how broadly it is adopted maybe there will be a clear need to formalize it further.
Approach it like the "application/problem+json" specification. That is an error-reporting format that is defined independent of the system producing the errors. You would obviously want specific schema error-oriented fields, and would not be going for the completely generic approach of that spec, but separating it encourages its use in implementations that can support it, without burdening more constrained implementations with support.

erayd commented 6 years ago

I really like this:

Approach it like the "application/problem+json" specification. That is an error-reporting format that is defined independent of the system producing the errors. You would obviously want specific schema error-oriented fields, and would not be going for the completely generic approach of that spec, but separating it encourages its use in implementations that can support it, without burdening more constrained implementations with support.

If it's attractive enough, I would certainly support it in the implementations I'm involved with.

trajano commented 6 years ago

I think application/violation+json as a MIME type sounds better. For those who may not know (like me) the thing before the +json is the MIME subtype.

handrews commented 6 years ago

@trajano @erayd I'm glad to see multiple people taking interest. I do not have time to focus on this right now, but here is what I would recommend:

Don't worry about making it an actual media type right now. You may want to do that, or you may (for instance) just want to define a specific extension of application/problem+json that can be identified by a schema describing the validation error output. Then you do not need to go through RFC, but can instead just apply two existing technologies and see how that goes to start.
File an issue to track this over in the json-schema-org.github.io repo. Storing your proposal there as extended documentation / best practices seems like the best starting point to me.
I'm going to close this as I do not think it belongs in this repo at this time. But that is just me managing the repo scope, not a rejection of the idea at all.

trajano commented 6 years ago

Closing this issue since it is out of scope of this project.

yurikhan commented 6 years ago

I am interested in this. Subscribing for comments, even if only to know which other issue to track.

handrews commented 6 years ago

There have been a few other issues around error reporting and/or formal definitions of validation results, so I have made an "output" label to track these. I will re-open this and label it, at least until we figure out where all of these concerns should really live.

Somewhat related: I'm defining a "recommended" output format for Hyper-Schema, although it is not a strict requirement. It does seem useful, though, which is why I am re-considering whether a recommendation, if not requirement, might be in-scope. The situation in hyper-schema is a little different, though (the mandatory output of the process is much more complex than a boolean result, even ignoring error reporting).

Anthropic commented 6 years ago

@handrews should validation results be a vocabulary? I think it would be awesome to have a consistent format/structure/schema for responses that validation implementations can work toward to ensure compatibility to make transferring from one validator to another more seamless. Just a week ago I posted this issue on djv to support a similar, more detailed, error message format to ajv. I think if ajv and djv the two fastest validators have a similar format then it could act as a starting point for a predictable validation result schema.

handrews commented 6 years ago

@Anthropic I'm not sure I would consider output formats to be "vocabularies", as to me a vocabulary is a set of keywords to use in schemas to either annotate or assert conditions about an instance document. I think output documents are instances (and not also schemas). They can be described by a schema, but they do not constitute a schema as far as I can tell.

Unless I am misunderstanding what you mean? I don't think we need a new vocabulary to write a schema describing the output format.

When you look at the hyper-schema rewrite, you will see the definition of the output format there. Although it does not include an error reporting format.

I have no particular point here, just acknowledging that there is something to this topic and I'm not sure where it fits.

Anthropic commented 6 years ago

@handrews I see your non particular point :wink:, what I meant was it just felt looking at the ajv error format with keyword, dataPath, schemaPath, params and message seemed to have its own keywords, so I wasn't sure where it belongs, you could say I was thinking out loud. Not concerned with the where as much as wanting to figure out or get ideas on the best place to define such a format.

gregsdennis commented 6 years ago

I'm already doing something similar in my implementation, though it's merely in an object model. I'd have to update that model and make it serializable, but it's not a stretch for me to support something like this.

I like the idea of standardizing output.

vearutop commented 6 years ago

Important thing is that validation failures can be deeply nested by *Of keywords. And to understand the root error you'll need to follow sub-errors. Sample error message:

No valid results for oneOf {
 0: Enum failed, enum: ["a"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[0]
 1: Enum failed, enum: ["b"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[1]
 2: No valid results for anyOf {
   0: Enum failed, enum: ["c"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]->anyOf[0]
   1: Enum failed, enum: ["d"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]->anyOf[1]
   2: Enum failed, enum: ["e"], data: "f" at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]->anyOf[2]
 } at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo->oneOf[2]->$ref[#/cde]
} at #->properties:root->patternProperties[^[a-zA-Z0-9_]+$]:zoo

I'm not sure standard way of error reporting is necessary (the only obvious use-case to me is to change one implementation to another and keep same error handling code), but I would implement it.

handrews commented 6 years ago

@vearutop yeah, the *Of keywords are a major concern. I think the key for them is designing a simple raw error format that tools can navigate to provide better feedback. For example, being able to show errors in some sort of hierarchical drill-down display.

This is off the top of my head and not a serious proposal:

If I'm reading your example right you're working with a schema like the following:

{
  "type": "object",
  "properties": {
    "root": {
      "type": "object",
      "patternProperties": {
        "^[a-zA-Z0-9_]+$": {
          "oneOf": [
            {"enum": ["a"]},
            {"enum": ["b"]},
            {"$ref": "#/cde"}
          ]
        }
      }
    }
  },
  "cde": {
    "anyOf": [
      {"enum": ["c"]},
      {"enum": ["d"]},
      {"enum": ["e"]}
    ]
  }
}

with an instance of:

{
  "root": {
    "zoo": "f"
  }
}

An error data structure could look something like:

[
  {
    "instanceLocation": "/root/zoo",
    "instanceData": "f",
    "errors": [
      {
        "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf"],
        "validSubschemas": [],
        "subschemaErrors": [
          {
            "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/0/enum"],
            "schemaValue": ["a"]
          },
          {
            "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/1/enum"],
            "schemaValue": ["b"]
          },
          {
            "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "/cde/anyOf"],
            "validSubschemas": [],
            "subschemaErrors": [
              {
                "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "#/cde/anyOf/0/enum"],
                "schemaValue": ["c"]
              },
              {
                "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "#/cde/anyOf/1/enum"],
                "schemaValue": ["d"]
              },
              {
                "schemaLocation": ["#/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2/$ref", "#/cde/anyOf/2/enum"],
                "schemaValue": ["e"]
              }
            ]
          }
        ]
      }
    ]
  }
]

This organizes errors first by the location in the instance that is failing validation, then indicates each schema location against which it fails that may be the cause of the overall failure.

Instance locations are plain (non-URI-fragment) JSON Pointers, while schema locations are arrays of URIs with JSON pointer fragments where a new pointer is added to the array whenever you cross a $ref.

If any of the subschemas were valid, validSubschemas would provide the URIs with JSON Pointer fragments of those subschemas. In this case, none are, so it's an empty array.

The goal here is to provide all of the key data in a standardized way, but not necessarily the natural language phrasing. So there's nothing here that says "Value not present in enum", because there are many ways that that phrasing might be chosen, not to mention I18N/L10N/A11Y issues.

But this approach would let tools display something simple like "No valid oneOf subschemas at..." and only display each individual error when drilling down. In my experience, the biggest problem with *Of error reporting is that it's hard to sort through everything to figure out what the real problem is vs "errors" that are expected (particularly with oneOf where all but one should fail).

handrews commented 6 years ago

Since draft-08 is intended to include annotation collection, which would involve a recommended output format, we should really handle a recommended error format at the same time. That would cover all of this issue, I believe, so I'm adding it to the milestone.

vearutop commented 6 years ago

I think it is worth adding keyword to error data structure. @handrews I've tried to describe your example (with few changes) in JSON Schema: https://gist.github.com/vearutop/17ef696fe1426844b302e844076400c5

wichert commented 6 years ago

@handrews I like your approach. From a UI perspective it looks like it provides all information necessary to produce useful error message, without leaking any implementation details.

There is a very nice quality to your proposal: it gives a very convenient way to define good error messages putting error messages directly in schema and looking them up using schemaLocation so you can do something like this:

{
  "type": "object",
  "properties": {
    "email": {
      "type": "string",
      "format": "email",
      "error": {
        "required": "Please enter your personal email address",
        "format": "Please a valid email address."
      }
   }
}

If schemaLocation is #/properties/email/format the error message is #/properties/email/errors/format. You can of course keep the error messages in a separate structure as well instead of polluting the schema. This is well out of scope for the current discussion, but I can see it being very attractive.

wichert commented 6 years ago

I realise that in my previous comment I made an assumption that you can just remove the last part of the JSON Pointer in schemaLocation to get a pointer to the variable. Is that a valid assumption, or would be it useful to add a variableLocation key?

handrews commented 6 years ago

@wichert

it provides all information necessary to produce useful error message, without leaking any implementation details.

Thats a great way to summarize one of the key use cases. We don't want to tell people how to display errors, and we don't want display code to have to understand how any one specific validator (or hyper-schema client, or whatever) works. It may help to understand how they work in a general sense, but it should be easy for someone to write an error display library independent of any validator.

Reporting errors seems to be one of the hardest parts of writing a validator, and this could separate those concerns.

I realise that in my previous comment I made an assumption that you can just remove the last part of the JSON Pointer in schemaLocation to get a pointer to the variable. Is that a valid assumption, or would be it useful to add a variableLocation key?

By "variable" do you mean the name of the object property with the value that was in error? That should be apparent from instanceLocation, which is up at the top of my example. There may be multiple errors for multiple schema locations for a single instance location.

Regarding error strings or other messages, see also #148 (Add "messages" property) and #270 (Support for error "level"). I've long been skeptical of both as a part of the standard, but I suppose having a standard output/error format may make this sort of thing fit better. I think they also work very well as extension properties (even if the implementation does not use them, when you go to look at the schema you'll see them). And it's not clear where they go- core because they're a very general thing? Validation because it's where we have assertions for the most part? And how keyword-specific would message structures end up being?

So I'd probably prefer to leave messages and levels to extensions for now and possibly adopt them into a spec if a clear convention emerges.

handrews commented 6 years ago

@vearutop thanks for writing up the schema! I actually had a keyword property in my example at first, and then had some concern over not all schema locations corresponding directly to a keyword.

But thinking about it more (I wrote the example very quickly) I agree that most if not all are related to a keyword (e.g. ".../oneOf/2" is related to oneOf even if that schema location is only part of the oneOf value), and having to find the most relevant keyword in a JSON Pointer is annoying.

I've only had a chance to skim over the other details of your alterations but I like the direction. We don't want to pin things down too much because the error format should work for extension keywords or future vocabularies without having to update it for the specific additional keywords involved. But I think we can categorize keywords in a way that strikes a balance between listing them out and not being able to indicate different error formats (such as subschema errors) at all.

Julian commented 6 years ago

For reference, this is what the Python implementation uses here:

http://python-jsonschema.readthedocs.io/en/latest/errors/

Though yeah not in love with the names we ended up picking.

Will have to compare that to the proposed schema above though I'm pretty sure we at least cover the same information.

yurikhan commented 6 years ago

For another reference, here’s the description of what I implemented for RapidJSON: http://rapidjson.org/md_doc_schema.html#Reporting

gregsdennis commented 6 years ago

With some direction from @handrews, I think I have a format that addresses all of the primary concerns while also considering the advent of annotation collection.

EDIT This is not the approach I currently use in my implementation. It's a new design that meets the requirements below.

In short, we want a format that

Is easy to navigate
Combines the instance data with the erroring/annotating keyword that pertains to it
Provides paths to keywords that produce errors/annotations to be expressed in two formats
- relative URI to indicate how the keyword was reached
- absolute URI to provide a direct link (so that applications don't have to follow a dereference chain)

To address the first, the format mimics the structure of the JSON instance being validated.

To address the second and third, a recursive output format has been created that contains

the instance data
the overall validation result
any errors in the event of a failure
any annotations in the event of a success

Examples

Schema (for both examples)

{
  "type":"object",
  "title":"root",
  "$defs":{
    "integerMin5":{
      "title":"referenced title",
      "type":"integer",
      "minimum":5
    }
  },
  "properties":{
    "intProp":{
      "title":"it's an int",
      "description":"found the int",
      "type":"integer",
      "minimum":5
    },
    "nested":{
      "type":"object",
      "title":"internal",
      "anyOf":[
        {
          "title":"a passed",
          "required":["a"]
        },
        {
          "title":"b passed",
          "required":["b"]
        }
      ]
    },
    "offsetCoordinate":{
      "type":"object",
      "title":"an offset coordinate",
      "properties":{
        "X":{"$ref":"/$defs/integerMin5"},
        "Y":{"$ref":"/$defs/integerMin5"}
      }
    }
  }
}

Instances & Results

A valid instance

{
  "intProp":9,
  "nested":{
    "b":[]
  }
}

produces the output

{
  "instanceData":{
    "intProp":{
      "instanceData":9,
      "result":"passed",
      "annotations":[
        {
          "source":"/properties/intProp/title",
          "value":"it's an int"
        },
        {
          "source":"/properties/intProp/description",
          "value":"found the int"
        }
      ]
    },
    "nested":{
      "instanceData":{
        "b":{
          "instanceData":[],
          "result":"passed",
          "annotations":[
            {
              "source":"/properties/nested/anyOf/1/title",
              "value":"b passed"
            }
          ]
        }
      },
      "result":"passed",
      "annotations":[
        {
          "source":"/properties/nested/title",
          "value":"internal"
        }
      ]
    }
  },
  "result":"passed",
  "annotations":[
    {
      "source":"/title",
      "value":"root"
    }
  ]
}

A failing instance

{
  "intProp":9,
  "offsetCoordinate":{
    "X":10,
    "Y":3
  },
  "otherProp":true
}

produces the output

{
  "instanceData":{
    "intProp":{
      "instanceData":9,
      "result":"passed"
    },
    "offsetCoordinate":{
      "instanceData":{
        "X":{
          "instanceData":10,
          "result":"passed"
        },
        "Y":{
          "instanceData":3,
          "result":"failed",
          "errors":[
            {
              "source":"/properties/offsetCoordinate/Y/$ref/minimum",
              "absoluteSource":"/$defs/integerMin5",
              "message":"The value '3' is not greater than or equal to '10'."
            }
          ]
        }
      },
      "result":"failed",
      "errors":[
        {
          "source":"/properties/offsetCoordinate",
          "message":"A subschema failed validation."
        }
      ]
    },
    "otherProp":{
      "instanceData":true,
      "result":"noRequirements"
    }
  },
  "result":"failed",
  "errors":[
    {
      "source":"/",
      "message":"A subschema failed validation."
    }
  ]
}

Note 1 All annotations are dropped when the overall schema fails, even those annotations for the portions of the instance that passed.

Note 2 The "A subschema failed validation" error may or may not be required. It may be sufficient to merely include errors on those elements that explicitly failed. This may be an implementation option.

Description of the output

The basic result object is recursive and can have the following properties:

instanceData - The value of the instance at the current location. Operates recursively to handle objects and arrays.
result - An enumeration of passed, failed, or noRequirements
annotations - Only present when result is passed. An array of annotation objects. Each annotation object contains:
- source - A JSON Pointer to the schema keyword that produced the annotation, relative from the root of the schema.
- absoluteSource - If source contains a $ref, this is an absolute URI to the source of the annotation.
- value - The annotation value.
errors - Only present when result is failed. An array of error objects. Each error object contains:
- source - A JSON Pointer to the schema keyword that failed validation, relative from the root of the schema.
- absoluteSource - If source contains a $ref, this is an absolute URI to the source of the failed keyword.
- message - The error message.

Whenever a value in the JSON instance does not have any requirements, result will be noRequirements, and any sub-data of the instance is ignored (so there is no recursion into these values). Additionally, neither errors nor annotations will be present in the result object.

For instanceData, when validation takes place or annotations collected, recursion occurs in the case of arrays and objects. For arrays, each item in the array is encapsulated in the result object. For objects, the keys are maintained and the value for each key is encapsulated in the result object. This will result in similar navigation between the original JSON instance and its validation results object. For example, if a path to a value in the original instance is

/a/b/3/c

the corresponding validation results can be found in the results object at the path

/instanceData/a/instanceData/b/instanceData/3/instanceData/c

This is easily transformable by prepending/omitting /instanceData for each segment of the pointer path.

absoluteSource only appears when source contains one or more $refs, otherwise these values will be the same. The absolute URI could point to a definition inside the same schema or to an external schema source.

Output schema

As a bonus, here's a schema that describes the results object:

{
  "type":"object",
  "$defs":{
    "base":{
      "type":"object",
      "properties":{
        "instanceData":{
          "oneOf":[
            {"type":["string","integer","number","boolean","null"]},
            {
              "type":"array",
              "items":{"$ref":"#"}
            },
            {
              "type":"object",
              "additionalProperties":{"$ref":"#"}
            }
          ]
        }
      },
      "required":["instanceData"]
    },
    "annotation":{
      "type":"object",
      "properties":{
        "source":{
          "type":"string",
          "format":"uri"
        },
        "absoluteSource":{
          "type":"string",
          "format":"uri"
        },
        "value":true
      },
      "required":["source","value"]
    },
    "passed":{
      "type":"object",
      "properties":{
        "result":{"const":"passed"},
        "annotations":{
          "type":"array",
          "items":{"$ref":"#/$defs/annotation"}
        }
      },
      "required":["result"]
    },
    "error":{
      "type":"object",
      "properties":{
        "source":{
          "type":"string",
          "format":"uri"
        },
        "absoluteSource":{
          "type":"string",
          "format":"uri"
        },
        "message":{"type":"string"}
      },
      "required":["source","message"]
    },
    "failed":{
      "type":"object",
      "properties":{
        "result":{"const":"failed"},
        "errors":{
          "type":"array",
          "items":{"$ref":"#/$defs/error"}
        }
      },
      "required":["result", "errors"]
    },
    "noRequirements":{
      "type":"object",
      "properties":{
        "result":{"const":"noRequirements"}
      }
    }
  },
  "allOf":[
    {"$ref":"#/$defs/base"}
  ],
  "oneOf":[
    {"$ref":"#/$defs/passed"},
    {"$ref":"#/$defs/failed"},
    {"$ref":"#/$defs/noRequirements"},
  ]
}

trajano commented 6 years ago

I both like and dislike the success returning processing results.

I like it because I can see the processing results in case I have a false positive on the schema.

I dislike it because it will use up space where most cases would be happy path.

Perhaps implementers need to make this configurable

gregsdennis commented 6 years ago

@trajano the main purpose of returning the subschema results on success is to support annotations. Maybe if there are no annotations for a subschema, that branch can be collapsed.

I agree that this can be an implementation option.

Edit: Annotations are intended to be an optional result when validating. If the user opts to exclude annotations, a simple true would suffice for successful validations.

gregsdennis commented 6 years ago

@Relequestual @epoberezkin @awwright You guys are notably quiet on this issue. Have ye any thoughts?

awwright commented 6 years ago

I don't think it makes sense to define any sort of implementation details. We're defining a media type, not an API.

gregsdennis commented 6 years ago

@awwright we're trying to make the validation/annotation results part of the spec (or at least a parallel spec).

As far as my implementation goes, I would create an object model to represent the above structure (or whatever structure results from this discussion) that would then serialize appropriately.

The point is that output varies between implementations, and we're trying is standardize it. This issue is less about a media type (yes it's mentioned above), and more about a standard output format for validation results and annotations.

gregsdennis commented 6 years ago

@handrews would you apply the annotation label to this as well, please?

handrews commented 6 years ago

@gregsdennis label added

@awwright There are two things going on here:

Annotation collection, which has been a major focus for draft-08 for many, many reasons. Hyper-Schema is essentially a complex annotation. Code generation, UI generation, documentation generation- all done with annotations. We need to provide guidance on what annotation output looks like or else applications will not be able to reliably make use of them. We already provide a RECOMMENDED output format for hyper-schema. It is my intention for this output to also be RECOMMENDED rather than required (and it might mean we have to update the hyper-schema output to fit this, but that's fine, there's still only one new hyper-schema implementation that would have to change AFAIK)
One of the concerns raised when we spoke with the IETF JSON working group was that the implementations they had seen had inconsistent and usually unsatisfactory error reporting. This was coming from one of the more helpful people, so I am taking that feedback seriously. There have been many less formal complaints about the difficulties around error reporting, so providing a recommendation on what needs to be reported will help many thing. Again, this would be RECOMMENDED rather than MUST.

It should be noted that an implementation is not required to collect annotations, and should not necessarily be required to produce errors (although unless there is a truly compelling reason why the target environment makes error reporting infeasible, I doubt such an implementation would get used much).

All: I am under the weather with a bad cold and will not be contributing much to this discussion for the next few days. Other stuff that I've been posting or commenting on here is stuff that was already mostly complete (e.g. half-finished PRs that I just tidied up and pushed), or that I'd done the hard thinking about already and just needed a simple write-up.

In any event, I think that this is a topic that is best addressed by those who implement the spec, or who are implementing applications on top of spec implementations. So I hope to see more commentary comparing proposals / existing output and error formats from @Julian @yurikhan @erayd @erosb and whoever else I'm forgetting.

Let's set aside the question of whether this output format is within the scope of the spec- it is obviously of great interest to many people, and the error side at least is known to be of interest to influential IETF people. Whether this goes in the spec or as a less formal recommendation on the web site is irrelevant for now, let's just figure out what we want to recommend.

handrews commented 6 years ago

Also paging @mokkabonna @johandorland @korzio @davishmcclurg

handrews commented 6 years ago

@gregsdennis @yurikhan @Julian @trajano @vearutop you have all made or substantially commented on proposals, or shown what your implementation is doing in this area.

I don't have time right now to mediate a discussion, but as someone who does not write an implementation, I don't think I'm the most important person here anyway. If you can agree, or come close to an agreement, on what this should look like, that would be of tremendous use to the project. I'm also asking @philsturgeon if he can help moderate, although I'm not expecting him to add more proposals (unless he really wants to).

As noted earlier, we don't need to worry about whether this is part of the spec or a less formal recommendation on the web site, there is plenty of demand for this either way. I'd just like for the community to settle on something while I work on all of the other things going into this next draft.

Anthropic commented 6 years ago

I started a validator wrapper to ensure it was easier to switch validators in future if needed and @korzio was awesome enough to start changing djv errors in the direction of ajv's current error format for consistency.

My feedback would be:

A minimal output option for performance along with more verbose would be beneficial.

Sometimes the end user wants to "stop on first error", it would be worth considering how that could or should look as well perhaps, if there is a recursive result.
Recursing a tree doesn't give quick access to error counts where an array of EDOs (Error Description Objects) that conform to the requirements @gregsdennis laid out with instanceData etc.. would.

A recursive tree output for a massive schema with hundreds of fields would be quite a performance burden to report back for an error on one field, how can that be addressed? Was it already and I missed it?

Thanks @handrews & @gregsdennis for continuing to draw attention to this issue! :)

gregsdennis commented 6 years ago

It looks like the AJV format meets most of the requirements I put forth. It would need to be expanded to include annotations (which don't necessarily have to be populated) and a direct link to the schema source (in the case of $refs).

It wouldn't be too hard for me to implement that. I just want to nail something down before I muck about in my code. I don't want to start changing it only to have to change it again soon.

Edit

The primary difference is that AJV outputs an array of results where each result carries a pointer to both the offending instance data and the schema keyword that deemed it so, whereas my format uses the nested nature of the instance JSON as the output structure.

vearutop commented 6 years ago

https://runkit.com/embed/q7n6kpx6etcv here is a sample error reporting from ajv for ease of consideration.

I see two issues:

schemaPath only keeps last reference (origins are omitted in case of multiple references)
the structure is flat, missing hierarchy does not allow to read subschema impact

I think the error response structure is not to replace error messages, I see it as additional interface for programmable processing (not for end-user in any way), so error message phrase/translation/format is out of scope.

Performance-wise implementation could still return true/false for overall validation. Or it could return error response structure right after first root level schema invalidation. To me both of cases are out of scope of response structure format.

Maybe it would make sense to organize error response as a map with data pointers as keys, that would allow to quickly get a list of invalid properties for example for painting it red in some html form.

gregsdennis commented 6 years ago

@vearutop that's a nice example. Some things I'd like to see:

I think the output object needs a dataValue. May not work out well if the data is a large object/array, which is why I opted for the recursive instance-structured output.
How are $refs handled in reporting the schema location? We definitely need both a direct link (for ease of use) and a relative link (to indicate the path that resolved to it).
Need a similar output for annotations on successful validation.

Other than that, I think it's a good start. I imagine that most people have a similar format.

Anthropic commented 6 years ago

@gregsdennis I think dataValue/data should be an optional part of any error object definition, ajv only includes data if verbose output is requested for example and I just use the paths only in my library.

the structure is flat, missing hierarchy does not allow to read subschema impact

@vearutop ajv's format provides errors for each item even in the subschema (which I feel is too verbose for most needs but great for when debugging), so you can get the full impact can't you, what is missing you feel should be there in your link?

Many frameworks have in Intermediate Representation of the schema (combined with implementation specifics), most likely flatter for machine processing, which is why I would need some convincing on a hierarchy providing benefit, not to say I can't be :trollface:

I mostly agree with the rest :+1: especially needing both $ref schema paths. Annotations I am not up to speed on their definition, maybe someone can ping me a TL;DR in slack :smile:

vearutop commented 6 years ago

@Anthropic please check https://github.com/json-schema-org/json-schema-spec/issues/396#issuecomment-389734109 for hierarchical error response example.

It allows you to understand that on high level that /root/zoo failed. It failed because of #/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf did not match any sub schemas.

If that reason is not enough you can check failure reason further: 0, 1 by enum directly, 2 for a complicated reason.

Then you can continue in the error structure and get that #/properties/root/patternProperties/^[a-zA-Z0-9_]+$/oneOf/2 failed because it jumped to another schema by $ref and then failed in anyOf because of all inner subschemas failed by enum.

So, hierarchical error response is kind of story that you can read to get a full understanding of failure reason. Able to quit reading at level that you find deep enough.

With flat list it is not so apparent how to traverse from high level error to its root causes. Looping through whole list you might be able to build hierarchy.

gregsdennis commented 6 years ago

I thought it might be beneficial to have an AJV-style output example for the samples I presented earlier, so here it is. Note that this isn't AJV's output, it just uses that style. I've added the additional data that should be there, and consolidated the message and params properties for readability.

Example 1 (valid)

I added 'otherProp' to illustrate a point.

Instance

{
  "intProp":9,
  "nested":{
  "b":[]
  },
  "otherProp":true
}

Output

[
  {
    "dataPath":"/",
    "data":{
      "intProp":9,
      "nested":{
      "b":[]
      }
    },
    "result":"passed",
    "annotations":[
      {
        "source":"/title",
        "value":"root"
      }
    ]
  },
  {
    "dataPath":"/intProp",
    "data":9,
    "result":"passed",
    "annotations":[
      {
        "source":"/properties/intProp/title",
        "value":"it's an int"
      },
      {
        "source":"/properties/intProp/description",
        "value":"found the int"
      }
    ]
  },
  {
    "dataPath":"/nested",
    "data":{
      "b":[]
    },
    "result":"passed",
    "annotations":[
      {
        "source":"/properties/nested/title",
        "value":"internal"
      }
    ]
  },
  {
    "dataPath":"/nested/b",
    "data":[],
    "result":"passed",
    "annotations":[
      {
        "source":"/properties/intProp/title",
        "value":"it's an int"
      },
      {
        "source":"/properties/intProp/description",
        "value":"found the int"
      }
    ]
  }
]

Notes

Each element that has an annotation needs to have an element in the output array. This means that the output array can get rather large.
I don't mind the idea that the result and data properties be considered optional so that implementations can in-/exclude this based on a "verbose" flag or similar. But it's important that the spec defines it so that the output is consistent across implementations. Also, this format, when containing the data value, would contain multiple copies of portions of the instance, which is where the instance-based organization has a distinct advantage. (@Anthropic)
Only those nodes with annotations have entries. As a result, there's no indication that otherProp has no requirements.

Example 2 (invalid)

Instance

{
  "intProp":9,
  "offsetCoordinate":{
    "X":10,
    "Y":3
  },
  "otherProp":true
}

Output

[
  {
    "dataPath":"/",
    "data":3,
    "result":"failed",
    "errors":[
      {
        "source":"/properties/offsetCoordinate",
        "message":"A subschema failed validation."
      }
    ]
  },
  {
    "dataPath":"/offsetCoordinate",
    "data":3,
    "result":"failed",
    "errors":[
      {
        "source":"/properties/offsetCoordinate",
        "message":"A subschema failed validation."
      }
    ]
  },
  {
    "dataPath":"/offsetCoordinate/Y",
    "data":3,
    "result":"failed",
    "errors":[
      {
        "source":"/properties/offsetCoordinate/Y/$ref/minimum",
        "absoluteSource":"/$defs/integerMin5",
        "message":"The value '3' is not greater than or equal to '10'."
      }
    ]
  }
]

Notes

The first two entries in the output may be optional as their failure is implied. Technically, they should be present because a failure did occur at those nodes. Additionally, those elements allow for error messages to be logged for those locations.
Only those nodes with failures have entries. As a result, there's no indication that otherProp has no requirements.
All annotations are dropped, per annotation requirements.

philsturgeon commented 6 years ago

This thread looks like it's really making some progress, great job to everyone for working together on this.

I wonder if we can skip having the data field in there, as so far all proposed formats have a JSON Pointer. If folks want to look up the value, they can grab it from there.

gregsdennis commented 6 years ago

@philsturgeon one of the things that @handrews is pushing for is to have both the $ref path pointer and a direct pointer that needs no resolution. This is for convenience of the consumer.

It's following that logic that I keep putting the data value in there. I think having the data present is important. In that regard, the instance-based output I posted first prevents having duplicated instance data in the output.

handrews commented 6 years ago

@philsturgeon @gregsdennis I have not caught up on this thread, and have other things I'm focusing on right now, but I wanted to address the idea of putting the data in the output.

The reason for adding both the direct absolute schema pointer and the a pointer-ish schema path is that both require handling all of the $id/$ref resolution in the context of performing validation. It would be so difficult of applications to do this that the feature would be worthless to any application that needed this information.

You need the path to decide which annotation to choose if you have application-specific conventions for such things (e.g. title adjacent to the refer overrides title in the referred schema). You need the absolute URI (and it needs to be a URI, not just a pointer) in case you want to look at what else was in that schema for whatever reason.

Applications do not have the same problem with instance data. Given a JSON Pointer and an instance, it is trivial to resolve the pointer (as in, anyone with access to a JSON Schema library should have a JSON Pointer library available as well, and JSON Pointer isn't exactly hard to implement if you really need to do it yourself).

Julian commented 6 years ago

I am as usual unable to stay up with the level of activity here (which is great!), but the one thing I caught in passing here while reading some of these comments I think might be important to point out –

In my implementation, and I suspect many many implementations, I would not implement this schema "directly" as the native error format – i.e., the result of validating a schema (in success or error) would be more "native" to my language and its norms. So for instance, it is super inconvenient to deal with JSON refs. Way better is to deal with runtime methods that you can call to get the value that the ref points to, and never deal with the ref ever, until you need to say, write it down for later.

So what would happen though would be that in my implementation, where something called ValidationError is the mechanism here and which uses exceptions within my language which is the norm for errors, terminology on some attributes would start to converge more towards the sort of language used here (which of course is a slow process because I have to maintain backwards compatibility for quite awhile), and perhaps more usefully, I'd add a ValidationError.as_spec or some better named thing, which would return the JSON object here.

Just pointing that out because I think it's important for a comment or two here on convenience -- convenience for runtime I'd provide no matter what's in this schema, so I'd be focused more on necessity than runtime convenience.

handrews commented 6 years ago

@Julian you're thinking primarily in terms of people using your library's programmatic interface within a single Python process. That's not really what this issue is about- the specification cannot address the idioms of all programming languages, and of course implementations should be tailored to suit their languages and environments.

But consider a system with an API that is used by the company's web interface using HTML forms with JavaScript, and is also used by customers who call the API directly.

The JavaScript on the web page uses the schema to do client-side validation with a JavaScript implementation. The API back end could be implemented in Python with your library. It also needs to do validation because not all requests come from the web site.

Let's also assume that the schema uses at least one custom keyword that does some sort of complicated back-end verification. It doesn't really matter what it is, it just can't be done in the browser. That means that the browser might also get validation errors back from the server.

So now the code in the browser has to handle processing and displaying errors from two different implementations in two different programming languages. One of which is coming over the network. That means that without an interoperable format, that client code needs to understand multiple ways of presenting the results and errors. Or the back-end code needs to understand how to translate to whatever the front-end code can use.

Let's further complicate this by assuming a microservice environment, so some API calls are implemented in Go. Now there are at least three implementations that the front end has to deal with. And really, the front end and back end should not care which implementations are being used elsewhere. It should be possible to change languages and implementations in one place without having to update everything else.

To build a real, broadly interoperable ecosystem around JSON Schema, we need to think beyond the idioms and conveniences of individual languages and environments. They are still important, but they also do not need help from the spec (in fact, the spec trying to mandate idiomatic use would not scale and would generally be a mess). But the broader ecosystem does need help from the spec, at least at the level of a recommendation.

Implementors should not have to guess about whose output format to mimic, or have to convince each other to support particular formats or concepts out in the wild. We can have that discussion here and give everyone an interoperability target.

Julian commented 6 years ago

That was quite a wide-reaching response for quite a narrow comment :). It of course all makes sense, but I'm not sure what it has to do with my above -- will try and clarify.

I'm in fact very much not thinking of just my own implementation or environment, I was coming from exactly the opposite side of "I wouldn't here be aiming for things like 'convenience' as the primary concern when designing this format". I'd aim for standardization and terseness. Convenience can come from something using this format to do something.

To be more specific (I'm not sure I even agree with the following, but IIRC it was one of the things that triggered me at least making the above point) -- why should this format include both a pointer and the data at that pointer (which from my skim is apparently being proposed). It's true that's something e.g. my library does, but that's for convenience of a runtime user -- why does the 'standard schema for results' want to be anything more than a minimal description of the success or failure?

TL;DR: ISTM smaller is better.

Addendum, to try and be useful here too: the kinds of questions that have come up in my own experience here are things like "for the required keyword, where is the error? At the missing not-present property, or at the parent?". I go with the latter, and therefore claim to guarantee "all paths you find in errors are resolvable against the instance", but I suspect others might have made other choices. Same goes for "where is the error for additional unexpected properties that were disallowed by additionalProperties?"

And on another route, "what info is provided for errors from the format validator, and where does specific information about them go"

handrews commented 6 years ago

@Julian my apologies, I misread your comment as something of a "putting stuff in a standard form like this isn't that useful because I'm just going to make a different interface for Python anyway." I see that that was incorrect. I will follow up on your other points later.

Julian commented 6 years ago

Ah yes! Very much not that -- definitely more:

If you make a small standard thing, we can all provide that as an interchange, but need not worry about making that small standard thing all-encompasing, because convenience is well, in the eye of the beholder and very context-driven.

(And no worries! Definitely didn't assume it was hostile :)

gregsdennis commented 6 years ago

It sounds like we don't want to include the instance data in the results. In that case, the flat list format with instance pointers for

the instance location
the schema $ref path
the resolved schema

will be fine for the spec. Implementations can augment the results as they see fit, but I agree that the spec should indicate a minimum set of requirements.

Furthermore, when performing a simple validation, and the instance passes, it should be sufficient to respond with true. However, when annotations are requested, I suggest we follow a similar format as with error output of for no other reason than consistency.

json-schema-org / json-schema-spec