pydantic / pydantic-core

Core validation logic for pydantic written in rust
MIT License
1.43k stars 242 forks source link

Feature Request: 3rd party non-JSON serialization/deserialization #79

Open NowanIlfideme opened 2 years ago

NowanIlfideme commented 2 years ago

Hi, author of pydantic-yaml here. I have no idea about anything Rust-related, unfortunately, but hopefully this feature request will make sense in Python land.

I'm going off this slide in this presentation by @samuelcolvin, specifically:

We could add support for other formats (e.g. yaml, toml) the only side affect would be bigger binaries.

Here's a relevant discussion about "3rd party" deserialization from v1: https://github.com/samuelcolvin/pydantic/discussions/3025

It would be great if pydantic-core were built in a way where non-JSON formats could be added "on top" rather than necessarily being built into the core. I understand performance is a big question in this rewrite, so ideally these would be high-level interfaces that can be hacked in Python (or implemented in Rust/etc. for better performance).

From the examples available already, it's possible that such a feature could be quite simple on the pydantic-core side - the 3rd party would create their own function a-la validate_json, possibly just calling validate_python. However, care would be needed on how format-specific details are sent between pydantic and the implementation. In V1 this is done with the Config class and special json_encoder/decoder attributes, which have been a pain to re-implement for YAML properly (without way too much hackery).

Ideally for V2, this would be something more easily addable and configurable. The alternative would be to just implement TOML, YAML etc. directly in the binary (and I wouldn't have to keep supporting my project, ha!)

Thanks again for Pydantic!

samuelcolvin commented 2 years ago

Thanks for the question.

I think this is just calling validate_python, or indeed initializing the pydantic model as I guess you do now.

No changes should be required to pydantic-core to allow this.

I want to add support for line numbers in errors, but that requires a new rust JSON parser, so won't be added until v2.1 or later.

samuelcolvin commented 2 years ago

Closing, but feel free to ask if you have more questions.

samuelcolvin commented 2 years ago

To be clear, I would love this to be possible, but I don't want to have to add the capability to parse more formats to pydantic-core, so the only way this would be possible would be to achieve runtime linking of pydantic-core and the third party libraries that perform the parsing.

This is the only way I think of that it might work:

(Note 1: there's probably a better way to do this, I'm not an expert at this stuff) (Note 2: this might not work) (Note 3: I'm not even convinced this is a good idea and I don't promise to add this functionality)

With that out of the way, here's a very rough idea:

  1. New rust crates package which basic just exports:
    • JsonInput (new name required)
    • A thin (but opaque) wrapper for JsonInput which makes it available in as a python object (not actual to the dict etc. from JsonInput, just a way to return a pointer to JsonInput back to python land
  2. pydantic-core uses this new package to parse JSON, as it does currently
  3. pydantic-core also provides a new way to pass the thin wrapper to to SchemaValidator, pydantic-core then extracts the JsonInput and validates it with the same logic it uses now for json data
  4. 3rd party packages (written in rust) use the above pydantic-core-json-input crate, perform the logic of building JsonInputs in rust, then return them to python world to in turn be passed to pydantic-re

With this approach, while we go "via python", we never have to do the hard work to convert the JsonInput to a python object.

07pepa commented 2 years ago

i am coming here from here

and i want to share my ideas how to runtime pluging of should work....

my idea is to do it kind of similar to dll

you would have to have your pydantic-core plu-ins in one preagreed location (compiled) (or registered somehow) so we know where to load them from and we can avoid "dll hell" and on startup of pydantic-core you would say... hey load this and that of that version (no multiple version alowed for one instance of pydantic core (globaly you could have them but in app i think it would create confusion)

during deserialization you would just say what serializer to use... (format is not enought since there are more then one serializer/deserializer availible for one format (like simd json))

i am also not expert in this but i think there should be folowing requirements

1) more then one serializer allowed and allow to deserialize to one class from multiple format 2) more then one serialzier per format (chosen by name ?) 3) serializer can be chosen dinamicaly. 4) (not mandatory) only chosen serialzer are loaded (to limit load time of libary) 5) hard fail if serializer missing or is incompatible

why i am suggesting deoupeling...

if this is too hard i would just suggest to do in on compile time but that may be too complicated...

samuelcolvin commented 2 years ago

This sounds very hard to do in a reliable cross-platform way. Given the problems we already experience (at the scale of pydantic or watchfiles) with very subtle OS differences and wheels. I'm very unwilling to enter into this mess.

You're effectively proposing another way link libraries that side steps both crates and python imports, are there any other examples of python libraries that use DLL/ share libraries to access functionality in other packages without using the python runtime?

(Perhaps worth remembering that I'll probably be maintaining pydantic for the next decade at least, one "clever" idea proposed now which relies on shaky cross-platform behaviour, could cost me 100s of hours of support over the years - hence my caution)

I real question is how much faster would this be than my proposal above?

To proceed with the conversation, we need to benchmark that. Really someone need to build both solutions (and any 3rd solutions proposed) and see how they perform.

@PrettyWood @tiangolo do you have any feedback on any of this?

07pepa commented 2 years ago

well loading crates may be fine as well... but as said i am not expert....

samuelcolvin commented 2 years ago

crates would need to be a compile time dependency, so distributed wheels couldn't be used.

07pepa commented 2 years ago

ah.... yea... i forgot about that..... because i would just force you to compile when you install library...

however if there are little to no perfomance impatcs for @samuelcolvin solution i would be also fine.

However there are people "needing" SIMDjson... and in extreame cases perfomance may degrade

samuelcolvin commented 2 years ago

If you care about "extreme performance", don't use python, build the whole thing in Rust, Go or C.

NowanIlfideme commented 2 years ago

Sorry for missing this discussion 2 weeks ago...

I need to check out and play with the current (v0.3.1) version of pydantic-core before I can really give an informed opinion, but from a cursory glance it seems that validate_python() should be enough to implement in Python-land.

Regarding Rust-side implementation, I think that it all sounds too messy for a Python-facing library. "Config parsing" use cases don't require cutting edge performance anyways - you generally parse a single YAML file at the beginning of a script (vs many JSON API requests/sec). And YAMLs aren't usually passed between (performance-critical) applications since parsing YAML is slower anyways. There's similar considerations with TOML. I guess the most JSON-like thing would be XML derivatives, but I don't have much experience there, and haven't encountered anyone using Pydantic for XML yet 😉

samuelcolvin commented 2 years ago

I agree, validate_python is enough for everything except performance critical applications.

The only other thing you might need is line numbers, that's one of the main drivers (for me) of #10.

We need to think about how to make this possible without adding complexity or damaging performance.