jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.28k stars 67 forks source link

Support omitting Struct attributes from encoding/decoding #199

Open jcrist opened 1 year ago

jcrist commented 1 year ago

Since Struct types force users to declare all Struct attributes, there's currently no way to hold extra state on a Struct without having it also be encoded/decoded.

Two ways to handle this:

class Point(Struct):
    x: int
    y: int
    _private: Omit[some_type]
IcyMidnight commented 1 year ago

Yes please! I would like to cache some (expensive) computed values in my objects, but not serialize them.

jcrist commented 1 year ago

Thanks for the motivating use case (and apologies for the delayed response)! I hope to tackle this issue in the next few weeks before issuing a new release.

Since you commented here - as a user, would you want omitted attributes to also be excluded from __init__? Giving a concrete example:

class Point(Struct):
    x: int
    y: int
    _private_1: Omit[int]
    _private_2: Omit[int] = 0

Should __init__ include _private_1 and _private_2 arguments? Or should they be excluded from __init__, and only default values filled in? If they're excluded, that would mean that _private_1 would raise an AttributeError until it's explicitly set, and _private_2 would have an initial value of 0 after __init__.

IcyMidnight commented 1 year ago

Honestly I haven't been using the constructors much (decoding from a web request, encoding to a disk cache, decoding from the disk cache) so I don't have a strong opinion. I guess given that the object is valid without those values, the constructor shouldn't require them, but it would be nice if you could include them.

Also I would lean towards excluding attributes with leading underscores by default (without the Omit type), but I could see it being more annoying to have to include them if some API you're using has names like that.

jcrist commented 1 year ago

Ok, we finally have the internals setup to make this fairly straightforward to implement.

First, I plan to call these "private" attributes, since that's effectively what they are. A private attribute won't be encoded or decoded, but will have space allocated on its respective struct.

There are some open questions though.

How should this be spelled?

There are two places that field configuration like this can go:

In the type annotation

class Example(Struct):
    x: int
    y: Private[int]
    z: Private[float] = 0.0

An option in msgspec.field

class Example(Struct):
    x: int
    y: int = msgspec.field(private=True)
    z: int = msgspec.field(default=0.0, private=True)

The former I find a bit easier to read, but it adds some complications:

Right now I'm leaning towards the msgspec.field approach. Many users will already be familiar with dataclasses.field, which includes support for omitting fields from the generated __init__ via dataclasses.field(init=False). This is similar enough to private=True that it shouldn't be that wild of a concept.

Should fields with leading underscores be private by default?

Since we're providing a way to set whether a field is public/private explicitly, users can always override any default behavior chosen here. Since fields with leading underscores are usually treated as private in python by convention, I'm tempted to do so here as well. So the following would be equivalent:

class Example1(Struct):
    _field: int = 0  # private, since the leading underscore

class Example2(Struct):
    _field: int = field(default=0, private=True)  # explicitly private

On the other hand, I'd expect most users to never need to define private fields, so maybe inferring private/public status from the field name itself is overreaching.

Are private attributes excluded from __init__?

Right now I'm planning on excluding private attributes from basically everything in msgspec. They'll have space allocated in the __slots__ of the generated Struct type (and corresponding attributes defined), but will be excluded from:

This makes them most useful for including extra state during runtime that is initialized after the Struct instance is already created. cached_property usage is related here (#204), since you'd need a slot allocated to store the cached value.

If you have a use case for including private attributes in __init__, please let comment below. If needed, worst case we can add another init parameter for including/excluding a field from __init__. private=True will imply init=False by default, but you can manually re-include it as field(private=True, init=True).

Should private fields require a default value/factory?

Right now all msgspec.Struct types require all fields be initialized as part of the generated __init__. The only way to get an AttributeError for a field on a struct is if you manually del the attribute off an existing struct. I think this makes sense. mypy/pyright type annotations are designed to handle types (or unions of types), but not the presence/absence of an attribute.

Keeping with this design (and assuming private fields are excluded from __init__), it may make sense to enforce that all private fields have a default value (since there's no other way a value could be set).

class Example(Struct):
    private1: int | None = field(private=True, default=None)  # this is ok, there's a default value
    private2: int | None = field(private=True)  # this is invalid, there's no default value

Summary

In summary, my current leanings on these questions are:

I don't plan on adding this before the next release (hopefully out next week), but after that whatever choices made here should (hopefully) be fairly straightforward to implement :crossed_fingers: .

Solumin commented 1 year ago

This would be of great use to my current project --- like IcyMidnight, we've got a cache that would be really pointless to encode. The current plan is to clear the cache before encoding, which will work as long as we remember to do it.

  • Use msgspec.field for configuring if a field is private.
  • Make fields with a leading underscore private by default.

What happens if you do _something = field(private=False)? I think this should be allowed: a private field (in the Python convention sense) that's not private (in the msgspec sense).

Additionally, automatically treating fields with a leading underscore as private would break backwards compatibility. I found a few examples of Structs with _private fields which makes me think that making private fields an explicit, opt-in choice is a better idea.

Really looking forward to this feature, I hope you have time to work on it soon.

illeatmyhat commented 1 year ago

Adding a private=True parameter to field() seems fine to me. Private omitted fields are the only way to have self-referential Structs. e.g. a nested and occasionally recursive data structure with access to the _root node.

We can also use it to compute and store the path in that data structure because Structs aren't aware of their parent nodes.

FeryET commented 1 year ago

Hi. Any update on this? I'm confused whether the library has this attribute.

ahopkins commented 1 year ago

Right now I'm leaning towards the msgspec.field approach. Many users will already be familiar with dataclasses.field, which includes support for omitting fields from the generated init via dataclasses.field(init=False). This is similar enough to private=True that it shouldn't be that wild of a concept.

If this is still up for debate, I concur that this is the right approach.

If you have a use case for including private attributes in init, please let comment below. If needed, worst case we can add another init parameter for including/excluding a field from init. private=True will imply init=False by default, but you can manually re-include it as field(private=True, init=True).

class Primitive(Struct):
    iid: int = field(private=True)    # internal ID, think auto-incrementing sequence meant to be private
    eid: str = field(private=False)   # external ID, some sort of public-exposed string identifier

In this use case, the object when fetched from the DB would have both IDs so the application layer has them available. But, when serialized as an HTTP response, we strip out the private iid.

Use msgspec.field for configuring if a field is private.

Agreed.

Make fields with a leading underscore private by default.

This is a nice API :sunglasses:

Exclude private fields from init. We can add a way to override this later if needed.

See my comment above. I am not sold on this one. The two paramater approach makes more sense to me as they are different use cases. I am fine if private=True implies init=False. But, I want to have the flexibility for private=True, init=True.

Require all private fields set a default value.

I guess I am okay with this, as long as I can still require it as an __init__. But once that is allowed, it seems to me this requirement falls off.

Solumin commented 10 months ago

Would private fields also be excluded from __struct_fields__?

dhirschfeld commented 10 months ago

I came here looking for this functionality so I can store some extra metadata on a Struct that I don't want to be serialized or part of the __init__.

I'm envisaging an api where I can have an alternative constructer which can take optional (metadata) kwargs and store them on the struct without having them be included in the serialized representation - e.g.

class MyStruct(msgspec.Struct, tag=True):
    metadata: dict = field(default={}, private=True)

    @classmethod
    def create(cls, **kwargs) -> MyStruct:
        instance = MyStruct()
        instance.metadata.update(**kwargs)
        return instance
>>> obj = MyStruct.create(key='value')

>>> obj
MyStruct(metadata={'key': 'value'})

>>> msgspec.json.encode(obj)
b'{"type":"MyStruct"}'

>>> msgspec.json.decode(msgspec.json.encode(obj), type=MyStruct)
MyStruct(metadata={})

My actual use-case is a bit more complex but it boils down to wanting metadata associated with my objects but only for their lifetime in the current interpreter.

dhirschfeld commented 10 months ago

Another example, of something I'd like to do is to attach an fsspec.AbstractFileSystem to the class which would then be used in a save method to serialize itself and write to disk - e.g.

class MyStruct(msgspec.Struct, tag=True):
    metadata: dict = field(default={})
    fs: fsspec.AbstractFileSystem = field(private=True)

    @classmethod
    def create(cls, fs: fsspec.AbstractFileSystem, **kwargs) -> MyStruct:
        instance = MyStruct()
        instance.metadata.update(**kwargs)
        instance.fs = fs
        return instance

    def save(self, filepath: str) -> None:
        self.fs.write_bytes(
            path=filepath,
            value=msgspec.msgpack.encode(self),
        )

msgspec doesn't know how to serialize an AbstractFileSystem, and I wouldn't want it to be serialized along with my actual data, however it would be convenient to be able to store it on the instances.

c0x65o commented 5 months ago

Any updates on this?