Pydantic should not be used as internal data container

minvws / nl-kat-coordination

OpenKAT scans networks, finds vulnerabilities and creates accessible reports. It integrates the most widely used network tools and scanning software into a modular framework, accesses external databases such as shodan, and combines the information from all these sources into clear reports. It also includes lots of cat hair.

https://openkat.nl

European Union Public License 1.2

126 stars 58 forks source link

Pydantic should not be used as internal data container #738

Open dekkers opened 1 year ago

dekkers commented 1 year ago

Pydantic primary purpose is data validation. This is great for things like parsing and validating JSON, but we don't need to do data validation for internal data structures and pydantic will have an unnecessary overhead.

This is a good post that describes different Python data containers and shows the overhead of pydantic:

https://towardsdatascience.com/battle-of-the-data-containers-which-python-typed-structure-is-the-best-6d28fde824e

For mutable internal data containers the best option would be to use dataclasses with slots. Given that the slots are only supported since Python 3.10 we should probably define our own dataclass decorator that will use slots on Python 3.10 and higher and fallback to dataclass without slots on earlier Python versions.

If the data container can be immutable then namedtuple is the best option.

Darwinkel commented 1 year ago

Interesting, that blog post makes a persuasive case for dataclasses. Their syntax (both with and without slots) is pretty clean as well - better than Pydantic or regular classes. It's another reason why we should consider upgrading to a higher version (https://github.com/minvws/nl-kat-coordination/issues/627). Dropping support for <3.10 altogether would save us work, and I think it's becoming more and more interesting to consider providing/requiring a native Python 3.11 environment for Debian and Ubuntu.

Donnype commented 1 year ago

I have experienced the performance hits of pydantic as well. I must say that we were working with a large amount of objects though and we experienced the overhead in writing these to a large json file.

We concluded the same thing at the time: why use Pydantic after initial validation has been performed? So yes I would be all for such a convention.

Having said that, regarding slots I would first want to see that using them actually has significant impact on performance before deciding to implement them everywhere.

praseodym commented 1 year ago

+1 for switching to data classes. For our container images this is another reason to update to Python 3.10+ (#627).

Having said that, regarding slots I would first want to see that using them actually has significant impact on performance before deciding to implement them everywhere.

I’d say that the tests in the linked post make a convincing argument that slots make a big difference, primarily in memory usage.

Donnype commented 1 year ago

@praseodym I see and I must say that my opinion also depends on the impact/complexity of the implementation as well. (Quick wins are always great.) But every time I do optimisations I learn not to assume I know the bottlenecks of my workload or impact of certain changes without profiling.

praseodym commented 1 year ago

“Pydantic V2 is 5-50x faster than Pydantic V1”, according to https://docs.pydantic.dev/latest/blog/pydantic-v2-alpha/. But I’m not sure if we should wait for it, or switch to data classes before the V2 release.

Darwinkel commented 1 year ago

Meeting notes (23-05-2023):

In the future, we will upgrade our codebase to the highest possible version (e.g. pyupgrade) to eliminate tech debt.

We will leave the data classes optimization for later, as we have easier performance improvements we can make. Pydantic should not be used as internal data container #738