jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.01k stars 59 forks source link

Accept hook functions in Meta for more ergonomic support of custom data types #707

Open 00dani opened 4 days ago

00dani commented 4 days ago

Description

Currently, to enable support for a custom type in msgspec, you write hook functions that look like this:

def enc_hook(obj: Any) -> Any: ...
def dec_hook(type: Type, obj: Any) -> Any: ...
def ext_hook(code: int, data: memoryview) -> Any: ...

These functions must handle all custom types you want to support, and they must be manually passed into relevant msgspec functions such as msgspec.json.decode() and msgspec.convert() at the point of use. This is inconvenient if you need to support more than one custom type, since they must all be handled as part of the same hook function. Additionally there are some msgspec APIs that don't accept the necessary hooks as arguments and therefore don't work properly with custom types (see #679 for example).

For instance, I'm using the URL type from the yarl package, so I wrote appropriate hooks to support it:

from msgspec import Meta
from yarl import URL as Yarl
from typing import Annotated, TypeVar, cast

URL = Annotated[Yarl, Meta(extra_json_schema={"type": "string", "format": "uri"})]

def enc_hook(obj: object) -> object:
    if isinstance(obj, Yarl):
        return obj.human_repr()
    raise NotImplementedError(f"Objects of type {type(obj)} are not supported")

T = TypeVar('T')
def dec_hook(typ: type[T], value: object) -> T:
    if typ is Yarl and isinstance(value, str):
        return cast(T, Yarl(value))
    raise NotImplementedError(f"Objects of type {typ} are not supported")

This works fine for encoding and decoding, but not for generating a JSON Schema, because you can't currently pass an encoding hook into schema generation. It's also rather unwieldy - Mypy has trouble understanding it and required a cast for reasons I don't quite understand, and if I want to support any additional custom types it's gonna get more complicated.

To remedy these gripes, I propose an extension to the typing.Annotated[T, msgspec.Meta()] syntax that's already used for applying constraints. Optionally, a Meta() structure should be able to contain each of the above hooks, as well as a schema_hook - whenever msgspec encounters a type it doesn't understand, it should try consulting that type's metadata for suitable hooks, falling back on the existing behaviour if there aren't any. For example, to support yarl URLs as I did above, I would write something like:

def to_yarl(value: object) -> Yarl:
    if isinstance(value, str): return Yarl(value)
    raise NotImplementedError(f"Cannot convert {type(value)} to URL")

def from_yarl(url: Yarl) -> str:
    return url.human_repr()

URL = Annotated[Yarl, Meta(
    dec_hook=to_yarl,
    enc_hook=from_yarl,
    extra_json_schema={"type": "string", "format": "uri"}
)]

This setup means that wherever the URL type is encountered throughout your codebase, msgspec will always have access to the appropriate hooks to deal with it, since they're bundled along with the type itself. The hook functions themselves are much simpler too, and msgspec shouldn't have to do much more work to retrieve them, since it's already regularly using the metadata it gathers through type inspection for other purposes.

The only caveat with this approach is that you have to use the annotated version of the custom type for this to work - msgspec still won't know what to do with the unadorned type from the upstream package. This isn't a problem if you're defining your own msgspec.Struct types and have full control over the types you want to decode and encode, but if you need to decode to a dataclass from a third-party package you can't use this technique and will need to keep writing "global" hook functions instead.

Thoughts?