jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.01k stars 59 forks source link

Validation on serialization #615

Open fungs opened 6 months ago

fungs commented 6 months ago

Description

I want to be sure that data which is serialized and transferred is really valid. Currently, the constraints are only checked when decoding them. To achieve this, one can of course try to decode the data on the sender side before submission. While this works for small objects, it creates a computational burden for very large objects.

Wouldn't it be possible to run the very same constraints checks at serialization time (on demand). In my understanding, it would create the same little overhead it currently does on the receiver side. Otherwise, is there a why to manually call a validation on the data?

FHU-yezi commented 6 months ago

Maybe related to #513?

We can validate the data when we create the struct object.

fungs commented 6 months ago

Thanks @FHU-yezi for the linked issue. I've read through it, and it is definitely related.

Let me try to explain a little further for everyone to understand this request.

IMO there are architectural and practical differences depending on who does the validation when. The goal should be to guarantee that a data structure was validated and not modified before serialization.

Strategy 1: In-type validation (variant validate-frozen-on-construction)

I totally like this concept because it merges the concepts of type and constraints. The distinction of both concepts is, in my eyes, is just an artifact of how computer systems commonly define and handle data types, mostly related to hardware architecture. However, to guarantee that the data is valid all the way until serialization, we must either write-protect it effectively (aka frozen objects), or we must revalidate after each possible modification. The former is difficult in Python due to its dynamic nature. The latter requires you to rewrite or wrap a type with all its write-enabled methods, even its accessible members.

An example of this approach is Pydantics NonNegativeInt type. If the type invariance says "I cannot be invalid", all is fine. I'd go for this approach in appropriate programming languages, not in Python. It would be really hard for anyone to write custom types.

Strategy 2: Lazy validation (serialization)

If we cannot guarantee a validated state or safeguard the type object from modification during processing, the logical option is to defer the validation to the time of serialization, thus circumventing the problem. To me, this also makes sense because usually the serialization routine needs to touch and re-encode every single item in the data structure, which would guarantee that we spend linear time on validation. It's important not note, that the validation needs to be type-informed, just like the serialization: both require deep knowledge about the semantics and structure of the type being processed.

In msgspec, validation is only applied for the back-transform. In this case, it doesn't really matter how it is done, because the full pipeline is implemented in msgspec itself. I assume, that for efficiency reasons, msgspec does validation on deserialization in C code, once the final data type objects are constructed in the chain.

Architecture

So why don't we just validate on instantiation and protect the data by code ownership until serialization?

The answer is software architecture. The data types in these kinds of frameworks (see attrs, pydantic, dataclass etc.) serve two different purposes: defining data models and interfaces and creating and working with objects easily and efficiently. So when building a standalone serialization layer for specific data, with a matching interface, the objects are constructed outside, maybe in a mutable version, maybe much earlier in the data processing pipeline, in custom code or in a different Python package, but relying on the very same interface definition. Thus, we cannot assume that all passed objects comply with the definition expected by the receiver.

That being said, if the struct constructor mentioned in #513 accepts an object of the same type with zero copy and can validate all the members, this would be equivalent to a simple validate(data) call to be run right before serialization (although probably less efficient than validation and serialization in the same procedure).

fungs commented 6 months ago

614 is inspired by the same architectural considerations.

FHU-yezi commented 6 months ago

@fungs said something really meaningful.

For strategy 2 he mentioned, we also have another use case: What if this struct will never be serialized?

In my case, the struct object is directly used by user's code, and it is only for auto complete and type checking, user will never serialize it, unless they want to store it in another place.

In that case, if we doesn't support validate on init, the struct defination may be different from the real data, which will lead to misunderstanding.

fungs commented 6 months ago

This seems to be a well structured approach to strategy 1: https://smarie.github.io/python-vtypes/

It might be compatible with msgspec, I need to test.