jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.4k stars 72 forks source link

Integrate with the attrs ecosystem #316

Open Tinche opened 1 year ago

Tinche commented 1 year ago

Description

Hello! I think this is a cool library, and I think you should consider integrating with the attrs ecosystem.

A small disclaimer first: I'm a major contributor to attrs (behind @hynek), the author of cattrs, and at this point I'm slowly getting comfortable starting to say I'm one of the mypy attrs plugin maintainers.

I haven't had the chance to use msgspec for real yet. You've done some very impressive things efficiency-wise, but I feel there's a lot of overlap between some work here and some work in attrs/cattrs. So now I'm thinking of ways to integrate.

Strategically speaking, I think msgspec is cool but limited in a number of ways. From what I understand, you don't support generics and PyPy (maybe other constraints I'm not aware of, not sure if you support dict classes?), which would limit my personal use of the library. Also I'm attached to the way cattrs does customization of de/serialization.

However, here's the thing: if we integrate the way I'm thinking, there would be no need for you to actually do these things (you could if you wanted to, or you could focus on other aspects). We could integrate msgspec classes (er, structs) into the attrs system and msgspec de/serialization into cattrs.

A simple idea would be having msgspec as an attrs backend that would be opt-in - so if your class isn't generic, and you don't use PyPy, you'd choose to use msgspec, otherwise you'd use the default attrs backend and have an attrs class. The resulting class would contain __attrs_attrs__, and so would essentially be an attrs class.

What I mean by the attrs ecosystem: for example, Rich supports attrs classes, although I know you support Rich in msgspec. Another example is an OpenAPI generator I'm working on for attrs classes, which would just work for msgspec classes too. A third example is work I'm doing for better typing of attrs.fields for use in ORMs/ODMs; this would also just work.

The first and most important question, however, is if you are even open to doing this? If yes, we can bounce ideas.

jcrist commented 1 year ago

Hi! Thanks for opening this. A few thoughts inline:

From what I understand, you don't support generics and PyPy (maybe other constraints I'm not aware of, not sure if you support dict classes?), which would limit my personal use of the library.

Generic structs (#193) are currently unsupported, but are at the top of my list to implement. This is a solvable problem, and is something I'd expect to be resolved in the next 3 months (hopefully sooner).

By "dict classes" I assume you mean Structs with a __dict__ defined? Those are unplanned and unlikely to happen without a strong use case. Do you have a use case for those over relying on __slots__?

PyPy support is also not planned. It's doable, but it's a lot of work and isn't something I think is worth it.

Also I'm attached to the way cattrs does customization of de/serialization.

I need to write up an issue for this, but before 1.0 I plan to redo our extension mechanism. A single hook for encoding and decoding (enc_hook/dec_hook) doesn't provide enough granularity, and is limiting support for fancier types. This won't exactly match how cattrs does things, but will be closer than what we have now. Would you be open to giving some feedback on the design there once I write things up?

if your class isn't generic, and you don't use PyPy, you'd choose to use msgspec, otherwise you'd use the default attrs backend and have an attrs class. The resulting class would contain __attrs_attrs__, and so would essentially be an attrs class.

IIUC you're saying that you'd use msgspec to accelerate cattrs (or attrs?) for cases where msgspec would provide drop-in functionality? I think this may be doable, but I'm not sure what the benefit would be here? If you're not using msgspec to encode the entire message (which you wouldn't if you want to use cattrs throughout) then the benefit of msgspec would be less noticeable. Faster __init__/__eq__ for ordered types, but that may be it. That said, depending on what's needed here we may be able to make something work. I'd be against adding __attrs_attrs__ to Struct types internally, but if we can provide the proper hooks that you can plumb things together then sure!

Another approach would be adding native support to msgspec for encoding/decoding attrs types, similar to what we have for dataclasses. I was able to get serialization working in about 10 minutes of hacking tonight (deserialization is always harder), but this would also be doable.

In [1]: import msgspec

In [2]: import attrs

In [3]: @attrs.define
   ...: class Example:
   ...:     x: int
   ...:     y: list[str]
   ...:     z: dict[str, int]
   ...: 

In [4]: msg = Example(1, ["two", "three"], {"four": 5})

In [5]: msgspec.json.encode(msg)
Out[5]: b'{"x":1,"y":["two","three"],"z":{"four":5}}'

If we went this route I'd definitely want advice from you on how this should be handled - you are after all the expert on serializing attrs things :).

Anyway, I guess in summary:

Tinche commented 1 year ago

My initial thoughts were not focused on serialization, although that's a part of it. Even if serialization wasn't part of the story, I also wouldn't discount the benefits of faster __init__/__eq__ so quickly ;) (Also I assume the __repr__ can also be accelerated? For example, we repr() for our logs on every request.)

Let's talk vision first.

Let me describe my personal work situation to give you a better idea of where I'm coming from. Our stack is heavily based on attrs:

Due to the nature of attrs, all of these work together (a class can be shared between the databases and all the interfaces).

Just to be clear, this isn't about solving my problems at work. (If it was about that, I'd just write a msgspec-attrs adapter internally and use it. Also I would absolutely love open sourcing all of this stuff, so that's one of my long-term goals.) This is about joining efforts for the benefit of both open-source ecosystems.

For some of these use cases, I'm doing work in the broader open source community to take it a step further. For example, my next step for the Mypy plugin will be to implement support for attrs.fields(cl) to allow statically-checked database projections (don't want to segue too much here). Once certain other features land in Mypy, I'll pick up work on uapi (think FastAPI but cleaner and for attrs). Other people are building their own stuff on attrs.

So what I'd like is this: we create a situation where I can drop msgspec into this stack and it just works, and is faster where that's possible. It's definitely not just about cattrs. It's a shame if we make users pick between super fast JSON serialization and using the model in a database too.

If we figure out a way to integrate:

To answer some of your comments:

By "dict classes" I assume you mean Structs with a dict defined?

Yeah, there are some use cases, for example @cached_property and Exceptions (it's loosely-assumed arbitrary attributes can be attached to Exceptions in certain context, I remember this coming up when I was adding ExceptionGroups to cattrs and the __note__ PEP was being discussed).

Would you be open to giving some feedback on the design there once I write things up?

Sure, of course.

Another approach would be adding native support to msgspec for encoding/decoding attrs types

This would be sweet. It'd let me integrate msgspec into cattrs fairly easily, for example.

To finish up this overly long post, some thoughts on integration. This section is brainstormy, just me spitballing ideas.

You wouldn't need to change how msgspec works by default.

It would be awesome if it was possible to just do:

from msgspec import define, Factory

@define
class Test:
    a: int = 5
    b: list[int] = Factory(list)

Because it would just be a drop-in replacement for vanilla attrs. This might not be possible because msgspec.field is already something different than attrs.field so it might confuse typecheckers, I don't know, maybe this stuff could live in msgspec.attrs instead. This interface doesn't have to support all of the attrs features, although I think you have most of them anyway. (The Mypy plugin would have to be changed to take this into effect.) Also you wouldn't need to depend on attrs, these could be conditionally created if attrs is already present.

These classes, being both attrs and msgspec classes, would work with both cattrs and msgspec serialization. The step after that would be me integrating msgspec into cattrs by implementing msgspec converters for json and msgpack.

jcrist commented 1 year ago

This would be sweet. It'd let me integrate msgspec into cattrs fairly easily, for example.

I spent some time hacking together full attrs support tonight. If you could take a look at/try out #323 that would be much appreciated.

I promise I'll do a longer writeup response to your comment above sometime tomorrow, it's already late here :(.

jcrist commented 1 year ago

Also I assume the repr can also be accelerated? For example, we repr() for our logs on every request.)

That's true, a quick benchmark shows ~2x faster for a simple message (from external conversations it sounds like you also might like #322, which should also be faster).

Repr benchmark Code ```python In [1]: import attrs, msgspec [0/300] In [2]: @attrs.define ...: class Test: ...: field_one: int ...: field_two: int ...: field_three: int ...: field_four: int ...: In [3]: attrs_obj = Test(1, 2, 3, 4) In [4]: class Test(msgspec.Struct): ...: field_one: int ...: field_two: int ...: field_three: int ...: field_four: int ...: In [5]: msgspec_obj = Test(1, 2, 3, 4) In [6]: %timeit repr(attrs_obj) 740 ns ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [7]: %timeit repr(msgspec_obj) 331 ns ± 0.586 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) ```

Yeah, there are some use cases, for example @cached_property and Exceptions (it's loosely-assumed arbitrary attributes can be attached to Exceptions in certain context, I remember this coming up when I was adding ExceptionGroups to cattrs and the __note__ PEP was being discussed).

Wait, are you saying you use attrs when defining custom exception subclasses? That's a use case I'd never thought of.

In case there's any confusion, let me be clear on what msgspec supports here:

It would be awesome if it was possible to just do: ...

I don't know, maybe this stuff could live in msgspec.attrs instead. This interface doesn't have to support all of the attrs features, although I think you have most of them anyway.

I was picturing attrs (or cattrs) optionally using msgspec behind the scenes here, not msgspec providing an attrs-like interface. We could do that, but I'd worry that any inconsistencies in implementation would lead to subtle issues on the user's end. What requirements are there for the output of define being an attrs object? Just matching the define kwargs and behaviors (for things we support)? Or do you also need the generated __attrs_attrs__ attribute? And if so, how often does that interface change?

Also note that msgspec.Struct types work by subclassing; any magic define decorator made here would have to inject the Struct base class into the class hierarchy to work properly.

The Mypy plugin would have to be changed to take this into effect

I've been hoping that we'll never have to write a mypy plugin (#19) with the new typing.dataclass_transform work, but mypy support for this is still lagging. If you're ever feeling motivated to write a new mypy plugin, I'd love some help here :).


Stepping back to look places where attrs might interact with msgspec:

Are there other interaction points missing here?

Tinche commented 1 year ago

Hi, I caught a cold so my replies here will be somewhat delayed, sorry ;)

jcrist commented 1 year ago

No worries, hope you feel better soon!

I'm hoping to cut another release in the next week or two - do you think you'll have time to review #323 before then? We can always extend our attrs support, but changing existing behavior after a release is harder.

jcrist commented 1 year ago

Hope you're feeling better @Tinche - any chance you could provide your thoughts on #323 sometime in the coming week? If possible I'd like to include a first pass at attrs support in the next release.

Tinche commented 1 year ago

Yessir, putting this on my shortlist

jcrist commented 1 year ago

@Tinche, do you think you'll have time to review #323 soon (as well as respond to my comment above)? If you don't have time to continue with this issue in the next week I'm going to close this as stale (we can always pick it up again later).

Tinche commented 1 year ago

Hello,

I've kinda had to declare open source bankruptcy recently since I'm way behind on everything, hence all the delays.

So let me try answering some of your questions.

I think it's perfectly ok if the hypothetical msgspec.define isn't 100% compatible with attrs.define since it's an opt-in thing, users will know what they're getting into. So slot classes only perfectly fine.

We'd need the __attrs_attrs__ attribute, since tooling requires it. I'd say the attribute is half of the point here. __attrs_attrs__ is an instance of a tuple subclass (it's not a NamedTuple, since those have restrictions on their attribute naming that are too severe for this use case), with properties that return attrs.Attribute objects. attrs.Attribute changes infrequently but it does change sometimes. We've never had other users of it though. I'm guessing we set up a communication channel so we let you know of upcoming changes (maybe I can contribute PRs when it happens?).

Also note that msgspec.Struct types work by subclassing; any magic define decorator made here would have to inject the Struct base class into the class hierarchy to work properly.

I don't think this is a problem, attrs.define already does crimes to add __slots__ to classes that don't have them. So you'd recreate the class with msgspec.Struct in the MRO too. There are some tricky that need to be done with rewriting closure cells in methods so things like bare super() works that we can help you with (dataclasses have the same problem, but we try harder than them to solve it).

I think your summary hits the nail on the head. But like I mentioned, I think speed (in __init__, __eq__, __repr__ is very valuable for some use cases (my use cases). If I can get significantly faster class construction by switching my classes to msgspec.attrs.define while the rest of my tooling continues just working, I'm going to do it, and I won't be the only one probably. I think you'd just need to provide a good subset, not the entire subset.

jcrist commented 1 year ago

I've kinda had to declare open source bankruptcy recently since I'm way behind on everything, hence all the delays.

No worries, I've definitely been there. Thanks for the reply!

I think I have enough info to hack something together given your answers above. No promises on when I might get around to this, but an attrs-compatible shim seems potentially useful, and at worst would be a fun thing to hack on.

nrbnlulu commented 1 year ago

Are there any conclusions out of this issue?