jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.39k stars 72 forks source link

Pass raw error data to ValidationError #264

Open IAmTomahawkx opened 1 year ago

IAmTomahawkx commented 1 year ago

It would be great if we could get the actual values of validation errors passed to ValidationError as attributes. Currently the best way to do this is to pass a regex over the error string ("Expected str, got bool - at $.files[0].content") to get the 3 values, expected_type, received_type, and location. From my uneducated view, it shouldn't be too hard to pass those through, as they're already available when the error string is assembled.

jcrist commented 1 year ago

Thanks for opening this issue.

Can you explain your use case here? What do you hope to do with this information? What are the types of expected_type/received_type/location here? Strings? Something else?

Note that not all ValidationError messages have an expected_type/received_type, but all of them do have a location.

jcrist commented 1 year ago

@IAmTomahawkx any reply on the above?

IAmTomahawkx commented 1 year ago

I did realize after making the issue that not all ValidationErrors have the same messages. Personally my intended usecase would be to return a structured response from an API endpoint instead of an error string.

I suppose this could in theory be solved by making a subclass of ValidationError for type errors (TypeValidationError?), with added attributes. This should keep compatibility to existing codebases using except ValidationError: while allowing for more toned in error handling.

As for the types, personally I don't care (as I'd be stringing them to send out as json anyways), but i suppose it would make most sense as actual types instead of strings. How feasible that is, i ain't familiar with the codebase so that's not my call.

jcrist commented 1 year ago

Personally my intended usecase would be to return a structured response from an API endpoint instead of an error string.

Can you comment with an example of the kind of structured message you'd like to be able to send?

IAmTomahawkx commented 1 year ago

The idea was to match our other validation responses, eg

{
    "errType": "InvalidType",
    "errZone": "body", // "query", "path", etc
    "errLoc": "$.files[0].content",
    "received": "integer",
    "expected": "string"
}

Whereas currently we simply have a separate type for body errors:

{
    "errType": "InvalidBodyType",
    "errMsg": "Expected `str`, got `bool` - at `$.files[0].content`"
}
jcrist commented 1 year ago

Ok. I think we can support this use case.

To do this we'd only include the stringified received/expected components, since it's hard to backtrack out the actual python type instances from our own internal representation.

Are the current string versions of expected/received/path sufficient for your use case?

For errors that don't have an expected/received type included in the message, we could do one of 3 things:

  1. Add a new error type that only occurs for errors of this form, as you suggested.
  2. Set these fields as None when no expected/received type is included in the message
  3. Add a best approximation of these fields. Errors that don't include an expected/received are generally of one of two forms:
    • The proper type was received, but it didn't match a constraint (in this case expected and received would be equal)
    • The value is parsed from a str value, and some parsing issue occurred (in this case expected would be the expected type, and received would be str).

I don't have a strong preference towards any of these options in particular.

Thoughts?

IAmTomahawkx commented 1 year ago

I think the first option would be the best, as it allows the most flexibility for changes to other errors (although location may be able to go on the base ValidationError). its also quite easy to identify the exact error being presented to you, as opposed to checking the values to determine what's going on. As for the stringified stuff, seems fine to me!

jcrist commented 1 year ago

Sounds good to me! Thanks for the helpful conversation here, I think we've arrived at a good solution. Should hopefully be able to get this resolved before the next release.

iodbh commented 1 year ago

Hi ! Just pitching in to describe my use case since this issue is exactly the one I was looking for :)

I have 2 for structured errors:

As an example, I use a tagged union to deserialize messages and i want to handle the error differently based on whether the tag field is invalid ("unhandled message type") or any other fields have errors ("invalid payload").

The solution you have been discussing here works well since all we need is some way to identify the field. A couple of related wishes for you consideration:

peterschutt commented 1 year ago

This would also allow Litestar to customize our client errors with less work (i.e., not having to rely on regex to parse the error message).

This wouldn't require any more than you have already detailed here.

guacs commented 1 year ago

I think the above discussion covers the issues I've presented in #468.

Also, I think something similar to pydantic's ErrorDetails class would be great. This has an attribute type which specifies whether the field is missing or invalid etc. Having something like that in the exception would be great as well. That is, details regarding whether the validation failed due to the field being missing, field has incorrect type or field has valid types, but the validation logic fails.

One thing, I wanted to add which I only realized now is that I think the way that msgspec does validation, it'll stop immediately when the first invalid input is reached correct? Or is my understanding incorrect? Would it be possible to keep going to be able to catch any other validation errors? This would help when returning an API response with a list of all the fixes to make instead of the user having to make the API request each time and fix errors one by one.

peterschutt commented 1 year ago

One thing, I wanted to add which I only realized now is that I think the way that msgspec does validation, it'll stop immediately when the first invalid input is reached correct? Or is my understanding incorrect? Would it be possible to keep going to be able to catch any other validation errors? This would help when returning an API response with a list of all the fixes to make instead of the user having to make the API request each time and fix errors one by one.

Hi @guacs - I've previously asked a similar question for which @jcrist went to some effort to explain his rationale for the "fail fast" behavior - I'll quote it here as it deserves visibility (ref):

No, this is not something I intend on supporting. For prior art, most typed JSON serialization libraries (across language ecosystems) don't support raising multiple errors. Neither golang's json nor rust's serde support this, and people generally love those tools.

When parsing JSON into a specified object type, there are multiple ways an error can occur:

  • The JSON is invalid
  • The JSON is valid, but the JSON type is incorrect (expected a number, got an array)
  • The JSON type is correct, but the value is invalid (e.g. unknown Enum value, int out of range, invalid datetime str etc...)
  • A JSON object is missing a field (or in certain situations has an extra field)

When parsing and validating, a failure in one location could be one error, or several. There's no clearly defined behavior for what should be expected. pydantic might group a few errors together and say "these are the things that are invalid". cattrs might find a different set of errors. But there's no clear definition of what "all the errors" means. Say for example you're parsing the value "2022-03-" into a type datetime | date | None. This string is not a valid datetime or date, and it isn't None. Is that one, two, or three errors?

Further, what if you receive a list of 1000 strings as an input for a field, but expected a list of ints. Is that 1000 errors (one for each item), or one for the list?

Maintaining a running structure of errors like this can get expensive, and make the failure mode in parsing significantly more expensive than the success mode. I'm skeptical of its general utility, and think if providing errors for multiple fields is important, you'd be better served by doing that validation yourself after parsing, so you can ensure the kinds of user-facing errors you wish to provide are uniformly handled, rather than whatever errors the parsing framework happens to collect.

Raising the first error found is easier to make consistent, easier for a user to build intuition around (you can look at a given message and a given type and know for sure what error will be raised), and much much more performant in the case of failures.

guacs commented 1 year ago

@peterschutt thank you for pointing this out. If this is something that's not being planned to be supported, I'm completely fine with that actually. Even though there a few points I'm not sure I agree with completely, I get the rationale and think it's reasonable.

IAmTomahawkx commented 11 months ago

Sounds good to me! Thanks for the helpful conversation here, I think we've arrived at a good solution. Should hopefully be able to get this resolved before the next release.

@jcrist do you have a timeline in mind for this nowadays?