Is there a standard notion of "best" error?

Julian commented 6 years ago

So, following up on #396, one thing I'd also love to see discussed or addressed is what my implementation calls best_match.

The purpose of this function is to heuristically answer the question "given a schema and instance with multiple issues, what is the most fundamentally wrong with the instance?".

An example to illustrate:

having the wrong type is a pretty big deal, and usually indicates that something is very very wrong with the instance
having everything correct but being slightly too short or long is way less fundamentally wrong with the instance

So if you imagine an instance with both issues, we'd select the first error over the second one.

So, question: how well defined is this notion, and can we standardize an algorithm here?

Here's my implementation (which by the way is not promised to return the same values over time, in case I come up with a better implementation, but that hasn't happened yet):

https://github.com/Julian/jsonschema/blob/master/jsonschema/exceptions.py#L274-L278

You'll find that this touches quickly on a notion of "descent" into anyOf / oneOf / allOf sub-errors, which I find pretty interesting to think about too, but I'll refrain from elaborating on that until someone else confirms this is interesting to discuss.

Anthropic commented 6 years ago

Do you mean something like a severity index for reported errors? Is that something that can relate to #270?

Julian commented 6 years ago

It's only tangentially related. That ticket it looks like is about being able to explicitly define an error level. I.e., it's for a new spec feature. This one isn't, it's about whether there is an existing well-defined heuristic we can agree on to help answer "imagine you are only allowed to show one and only one validation error. Which one should you show?"

handrews commented 6 years ago

@Julian I really like this idea. In one possible example I gave in #396 I tried a little bit to organize the errors, but "error with oneOf" is the simplest error to report but not really the most useful. That's a hard one, but I like your simpler example of type vs restrictions within type.

It would need to account for custom keywords somehow (which is more a question for #602 and the whole custom keyword registration requirements, so we needn't get into it here).

Julian commented 6 years ago

It would need to account for custom keywords somehow (which is more a question for #602 and the whole custom keyword registration requirements, so we needn't get into it here).

Yeah -- so, my implementation only defines a partial ordering, where anything that hasn't been explicitly deemed "low importance" or "high importance" essentially gets arbitrarily ordered, but certainly looks like it'd be useful to allow custom keywords to hint at where they belong (which maybe goes back to @Anthropic's point).

Anthropic commented 6 years ago

@Julian yeah I read your implementation and felt like a severity index covered that, but understood from your response that you wanted to go beyond that with this issue. I think having an error index which can then correspond to a severity index could be of real benefit for consistency of error prioritisation across implementation. It would need to factor in if the basic error is within an *Of as @handrews mentioned which goes back to the defined algorithm I thought you were suggesting, is that about right-ish?

awwright commented 6 years ago

I would venture to guess this is covered by several other issues. Maybe #31 ?

First, keywords are designed to be non-overlapping. If you have a schema that describes how a string is supposed to be formatted, but you provide a number, that number doesn't trip any of the assertions except "type".

Really the only case where there's a problem is with schemas that define multiple disjoint & nonadjacent sets of valid values, like oneOf can do.

https://stackoverflow.com/questions/49823500/how-to-validate-a-json-object-against-a-json-schema-based-on-objects-type-descr/49996397#49996397

I visualize the problem like this: Imagine we can plot valid values on a 2D axis (imagine, say, JSON Schema only describes numbers). The vast majority of schemas describe a valid region that's just a circle. If you're outside the circle region, you know which direction to travel to make the instance valid again.

But sometimes, you have a complex schema. Maybe it's a checkerboard pattern, or two circles, each on opposite sides, and far away from, the Y-axis. Which one did the user intend to target?

Does this sound on track?

handrews commented 6 years ago

@awwright this is a more general issue than #31 (I'd actually much rather close out #31 as it kind of wandered off into a proposal for select which I don't think is necessary or even viable).

I'm not sure I follow your analogy. Yes, a lot of schemas have straightforward error reporting, but the difficult ones are of interest here. And also optimizing the error experience, which people have complained about a lot across multiple implementations, including in the IETF JSON working group.

While I can sort of see why you're talking about overlapping keywords and type, I think @Julian's examples can be tweaked to avoid that problem.

json-schema-org / json-schema-spec

Is there a standard notion of "best" error? #632