python-attrs / cattrs

Composable custom class converters for attrs, dataclasses and friends.
https://catt.rs
MIT License
823 stars 113 forks source link

[Feature] Use typing.Annotated for customization #571

Open diego-oncoramedical opened 2 months ago

diego-oncoramedical commented 2 months ago

Description

There's a decent amount of overhead in creating separate converters that needs to be done, which can get cumbersome quickly for a large number of classes. I was thinking it'd be possible to use typing.Annotated to reduce this work in a backwards-compatible way, without messing up type checking for those who need it.

An example that could solve (#352 and #552 if I understand them correctly)

from typing import Annotated
import cattrs.annotations as ca   # hypothetically

class SomeKVPair:
    key: Annotated[str, ca.Rename("Key")]
    data_size: Annotated[int, ca.FromNested("Data.Size")]
    value: Annotated[bytes, ca.AsBase64, ca.Rename("Value")] = b""

There are a few benefits to this:

  1. Type checkers will still understand the annotations.
  2. No need to rely on metadata inserted using attrs.field() or similar.
  3. It's clear to anyone looking at just the class what transforms are made. If they want to do their own serialization (e.g. create an XML serializer) it'll be clear what transforms they need to do.
  4. Getting the extra annotations is opt-in, so it'll be invisible to older versions of cattrs that don't support it.
  5. Other libraries can also see these and use it for their own purposes.
  6. It's been available since 3.9. Version 3.8 goes EOL in October, so there's no need for compatibility shims.

Presumably this will be a non-trivial change, but I think it would be really valuable.

salotz commented 2 months ago

I like this. I've found myself reaching for attrs metadata a lot to do particular customizations but this doesn't work for basic types like dict etc. which I've found would be useful.

jonathanberthias commented 2 months ago

While interesting, I'm afraid this goes against the principles of cattrs. You might be interested by the motivation page, especially the "important" note after the first example. If you still want this feature, I guess pydantic supports it, but you'll have to change your attrs classes to pydantic.BaseModel.

salotz commented 2 months ago

If you still want this feature, I guess pydantic supports it, but you'll have to change your attrs classes to pydantic.BaseModel.

There is a big difference in the proposed approach and the basics of pydantic. In pydantic you have to actually opt into their special type system if you want to customize things. E.g. things like PositiveInt. The attrs/cattrs approach is to have the normal int type, and control when and how validation of "positive integer" is enforced through annotations that do not effect the core class definition itself.

The test for me is if you wanted to stop using pydantic and switch to another system you would have to go through all your classes/models and change all the types. With (c)attrs you would only need to delete annotations, the coupling does not extend into the actual data types. For me this separation is what is important in the Motivations of cattrs, not actually which file or class definition the policy lives.

In practice there is no way to avoid having specific annotations to your classes about how to customize serialization in special cases and contexts. How and where you write this behavior is only a question of style IMO.

For instance do you have a central place where you register hooks for a specific type? Or do you have that at the location in which the type is defined?

The way I do this currently is through attrs field metadata. By choosing this its really up to the converter to be able (and willing) to understand and honor that annotation. It in no way would effect the business logic code that works with the constructed object, like it potentially could in Pydantic.

The alternative would be to have something like a large centralized lookup table for every class in your code base with the policy for it, in each converter. Maybe this works for your team, but for many its much better to have it all in one file alongside the definition.

diego-oncoramedical commented 2 months ago

While interesting, I'm afraid this goes against the principles of cattrs.

I understand that philosophy, and generally agree with it. I too try to avoid putting logic into my dataclasses where possible.

However, having the ability to do so as an opt-in feature would be incredibly valuable in speeding up production, reducing code, and documenting field behavior in a both human- and machine-readable way. I'm not saying by any means we should make this the default, just that I think it could be immensely useful.

For example, here's something I've used at work to deserialize structured logs. (Fields irrelevant to this example have been removed.)

Currently

147 lines

from __future__ import annotations

import uuid
from collections.abc import Mapping
from typing import Any
from typing import Callable
from typing import Final
from typing import TypeVar

import attrs
import cattrs.gen
import cattrs.preconf.json as cattrs_json

@attrs.frozen(kw_only=True, slots=True, eq=False, order=False)
class LogRecord:
    json: JSONSubRecord
    kubernetes: KubernetesSubRecord
    kubernetes_cluster: str

@attrs.frozen(kw_only=True, slots=True)
class JSONSubRecord:
    function_name: str | None = None
    log_level: str | None = None
    line_number: int | None = None
    extras: Mapping[str, Any] = attrs.field(factory=dict)

@attrs.frozen(kw_only=True, slots=True)
class KubernetesLabels:
    extras: Mapping[str, str] = attrs.field(factory=dict)

@attrs.frozen(kw_only=True, slots=True)
class KubernetesSubRecord:
    labels: KubernetesLabels = attrs.field(factory=KubernetesLabels)

T = TypeVar("T")

type Unstructurer = Callable[[object], dict[str, Any]]
type Structurer = Callable[[Mapping[str, Any], ...], Any]
type UnstructurerFactory = Callable[[type], Unstructurer]
type StructurerFactory = Callable[[type], Structurer]

def _make_structurer_factory(**unstructure_fn_kwargs: object) -> StructurerFactory:
    """Create a structurer factory that moves unrecognized keys into an "extras" dict
    on the model."""
    def _factory(cls: type[T]) -> Structurer:
        # First we generate what cattrs would have used by default.
        default_structure = cattrs.gen.make_dict_structure_fn(
            cls,
            _LOG_CONVERTER,
            **unstructure_fn_kwargs,
        )

        # Get the set of attribute names on the dataclass. Anything that's not in this
        # set will get put into the "extras" dict.
        attribute_names = {a.name for a in attrs.fields(cls)}

        def structure(
            source_mapping: Mapping[str, Any], *_args: object, **_kwargs: object
        ) -> T:
            copy = dict(source_mapping)
            copy["extras"] = {}

            # Move keys that aren't model attributes into the "extras" dict, which *is*
            # an attribute.
            for extra_key in source_mapping.keys() - attribute_names:
                copy["extras"][extra_key] = copy.pop(extra_key)

            # Now that all the unrecognized keys have been moved, we can go ahead with
            # normal structuring.
            return default_structure(copy, cls)

        return structure

    return _factory

def _make_unstructurer_factory(**unstructure_fn_kwargs: object) -> UnstructurerFactory:
    """Create an unstructurer factory that hoists all keys in "extras" into the output
    dict, and removes "extras".
    """
    def _factory(cls: type[T]) -> Unstructurer:
        # First we generate what cattrs would have used by default.
        default_unstructure = cattrs.gen.make_dict_unstructure_fn(
            cls,
            _LOG_CONVERTER,
            **unstructure_fn_kwargs,
        )

        def unstructure(value: T) -> dict[str, Any]:
            result_dict = default_unstructure(value)
            extras = result_dict.pop("extras", {})
            result_dict.update(extras)
            return result_dict

        return unstructure

    return _factory

_LOG_CONVERTER: Final = cattrs_json.make_converter()
_LOG_CONVERTER.register_structure_hook(uuid.UUID, lambda value, _: uuid.UUID(value))
_LOG_CONVERTER.register_unstructure_hook(uuid.UUID, str)

_JSON_STRUCTURER_FACTORY_KWARGS = {
    "function_name": cattrs.override(rename="funcName"),
    "line_number": cattrs.override(rename="lineno"),
    "log_level": cattrs.override(rename="levelname"),
}

_LOG_CONVERTER.register_structure_hook_factory(
    lambda cls: issubclass(cls, JSONSubRecord),
    _make_structurer_factory(**_JSON_STRUCTURER_FACTORY_KWARGS),
)
_LOG_CONVERTER.register_unstructure_hook_factory(
    lambda cls: issubclass(cls, JSONSubRecord),
    _make_unstructurer_factory(**_JSON_STRUCTURER_FACTORY_KWARGS),
)
_LOG_CONVERTER.register_structure_hook_factory(
    lambda cls: issubclass(cls, KubernetesLabels), _make_structurer_factory()
)
_LOG_CONVERTER.register_unstructure_hook_factory(
    lambda cls: issubclass(cls, KubernetesLabels), _make_unstructurer_factory()
)
_LOG_CONVERTER.register_structure_hook(
    LogRecord,
    cattrs.gen.make_dict_structure_fn(
        LogRecord,
        _LOG_CONVERTER,
        kubernetes_cluster=cattrs.gen.override(rename="kubernetes.cluster"),
    ),
)
_LOG_CONVERTER.register_unstructure_hook(
    LogRecord,
    cattrs.gen.make_dict_unstructure_fn(
        LogRecord,
        _LOG_CONVERTER,
        kubernetes_cluster=cattrs.gen.override(rename="kubernetes.cluster"),
    ),
)

Using Annotated

Here's the equivalent using a hypothetical implementation of this feature (43 lines, a reduction of 71%):

from __future__ import annotations

import uuid
from collections.abc import Mapping
from typing import Annotated
from typing import Any
from typing import Final

import attrs
import cattrs.annotations as ca
import cattrs.preconf.json as cattrs_json

@attrs.frozen(kw_only=True, slots=True, eq=False, order=False)
class LogRecord:
    json: JSONSubRecord
    kubernetes: KubernetesSubRecord
    kubernetes_cluster: Annotated[str, ca.Rename("kubernetes.cluster")]

@attrs.frozen(kw_only=True, slots=True)
class JSONSubRecord:
    function_name: Annotated[str | None, ca.Rename("funcName")] = None
    log_level: Annotated[str | None, ca.Rename("levelname")] = None
    line_number: Annotated[int | None, ca.Rename("lineno")] = None
    extras: Annotated[Mapping[str, Any], ca.CatchAll] = attrs.field(factory=dict)

@attrs.frozen(kw_only=True, slots=True)
class KubernetesLabels:
    extras: Annotated[Mapping[str, Any], ca.CatchAll] = attrs.field(factory=dict)

@attrs.frozen(kw_only=True, slots=True)
class KubernetesSubRecord:
    labels: KubernetesLabels = attrs.field(factory=KubernetesLabels)

_LOG_CONVERTER: Final = cattrs_json.make_converter()
_LOG_CONVERTER.register_structure_hook(uuid.UUID, lambda value, _: uuid.UUID(value))
_LOG_CONVERTER.register_unstructure_hook(uuid.UUID, str)
salotz commented 2 months ago

@diego-oncoramedical The problem I see with your example is that there is no specialization for different converters. Should you always rename a field? Or only for JSON output?

For instance I have multiple JSON converters for the user facing API and one for putting into a database. The database one probably doesn't need to have things renamed since there are no ergonomics for converting it. Same for consuming external APIs that require some renaming.

So I like the idea of using annotations, but they need to be specified as policy rather than mechanism, and it should be the converters that decide if they are in scope of the policy and need to perform it. In you example: Which converters should do renaming? That is not currently handled in your example.

diego-oncoramedical commented 2 months ago

Should you always rename a field? Or only for JSON output? [...] Which converters should do renaming? That is not currently handled in your example.

For what I need to do, JSON is the only permissible format, so I didn't bother separating out the logic that way.

For instance I have multiple JSON converters for the user facing API

Agh yeah I didn't think about that. That would certainly complicate things.

they need to be specified as policy rather than mechanism

I've thought about this for maybe two seconds so take it with a pound of salt, but you just gave me an idea for how to do this in a similar declarative way:

from __future__ import annotations
import cattrs.preconf.json as cattrs_json
from cattrs.policies import *

class LogRecordPolicy:
    json: JSONSubRecordPolicy
    kubernetes: KubernetesSubRecordPolicy
    kubernetes_cluster = Rename("kubernetes.cluster")

class KubernetesSubRecordPolicy:
    labels: KubernetesLabelsPolicy

class JSONSubRecordPolicy:
    function_name = Rename("funcName")
    log_level = Rename("levelname")
    line_number = Rename("lineno")
    extras = FieldPolicy(CatchAll, NoDump)  # Needed for fields w/ multiple things

class KubernetesLabelsPolicy:
    extras = FieldPolicy(CatchAll, NoDump)

_LOG_CONVERTER = cattrs_json.make_converter()
# Only the top level *needs* to be registered; policies are recursively extracted and
# matched by field name.
_LOG_CONVERTER.register_policy(LogRecord, LogRecordPolicy)

I'm not a huge fan of the duplication, but to reduce the burden on the programmer fields that don't need special handling could be omitted and unrecognized fields ignored.

diego-oncoramedical commented 2 months ago

This could probably be done as a separate library that generates factory functions for converters, instead of needing to be integrated into the main cattrs library.

salotz commented 2 months ago

I've thought about this for maybe two seconds so take it with a pound of salt, but you just gave me an idea for how to do this in a similar declarative way:

Not a fan of that approach :) I think annotations or attrs field metadata would be clearer and less boilerplate.

This could probably be done as a separate library that generates factory functions for converters, instead of needing to be integrated into the main cattrs library.

Absolutely. Even if it was in cattrs it would be an optional "strategy" (which is just a function that registers a set of hooks/factories).

diego-oncoramedical commented 2 months ago

Not a fan of that approach :) I think annotations or attrs field metadata would be clearer and less boilerplate.

Yeah this is one of those things where you wake up the next day and are like "wait what".

Even if it was in cattrs it would be an optional "strategy" (which is just a function that registers a set of hooks/factories).

@jonathanberthias what do you think? It does go against that note in the docs but the flexibility and separation of concerns advocated for does have a high price for data that is solely intended to be round-tripped.

Tinche commented 2 months ago

Hi.

I'm on paternity leave so availability is limited. However, I just wanted to chime in that Annotated is too interesting and powerful for cattrs to completely ignore it. Yeah, the point of cattrs is to keep serialization logic separate from models, but another point of cattrs is to be very configurable and a useful tool to its users, so we cannot discount Annotated use. Also, when we add better validation, it will probably be based on Annotated under the hood.

I would be a 100% on board with us shipping a strategy that enables complex Annotated use. First though, maybe we should grow a small SDK for easier Annotated handling inside of cattrs. I feel like this would give a lot of value to folks writing their own annotations, and to us at the same time.

I just don't know about the road map. I encourage you folks to play around with this on your own in the mean time, I'm happy to give guidance.