jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.39k stars 72 forks source link

Flattening Structs #315

Open davfsa opened 1 year ago

davfsa commented 1 year ago

Description

This is more of a question than a feature request, but could turn into one.

One of my uses when it comes to deserialising something similar to:

{
    "version": 1,
    "data": {
        "options": [{}, {}]
    }
}

into

class Option(msgspec.Struct):
    ...

class Foo(msgspec.Struct):
    version: int
    options: list[Option]

I have scoured through the documentation and can't find an easy way to do this. The way I have managed currently is by deserialising the Struct to a dict and then parsing the JSON as a dict (using attrs), but would like to move away from it to reduce the amount of code to maintain (the reason I have been looking at msgspec, appart from the obvious speed gains!)

Thanks!

luochen1990 commented 1 year ago

Is msgspec.json.decode(msg, type=Foo) what you want?

davfsa commented 1 year ago

Is msgspec.json.decode(msg, type=Foo) what you want?

Yeah, would be nice to be able to do msgspec.json.decode(msg, type=Foo) and it be aware that options can be found inside the data field and extracted off there

jcrist commented 1 year ago

Hi! Support for flattening structs would be hard. It's doable, but not easily - there's a bunch of edge cases that can pop up as features are mixed together. I'd be happy to write up what makes this hard if you're interested, but in short I don't have plans to add this feature.

That said, I'm curious about your use case. Why do you want to flatten the runtime structure here? Why not write out the full structure of Foo matching how it's serialized?

In [8]: class Option(msgspec.Struct):
   ...:     x: int  # made up some fields for here
   ...:

In [9]: class Data(msgspec.Struct):
   ...:     options: list[Option]
   ...:

In [10]: class Foo(msgspec.Struct):
    ...:     version: int
    ...:     data: Data                        
    ...:

In [11]: msg = """                             
    ...: {                                     
    ...:     "version": 1,
    ...:     "data": {                         
    ...:         "options": [{"x": 1}, {"x": 2}]
    ...:     }                                 
    ...: }                                     
    ...: """                                   

In [12]: msgspec.json.decode(msg, type=Foo)
Out[12]: Foo(version=1, data=Data(options=[Option(x=1), Option(x=2)]))
davfsa commented 1 year ago

Thanks for the answer!

The reason for this is mostly because of an opinionated approach to an API wrapper I am working on. The data field for this payload feels a bit cluncky and useless, as it doesn't really contain much, but just makes things harder to access, specially due to Options containing more Options:

obj.data.options.data.options
# vs
obj.options.options

It was a choice we went with when implementing this part of the API for simplicity sake.

When I opened the issue my idea for this was something along the lines of:

class Foo(msgspec.Struct):
    version: int
    option: Option = msgspec.field(location="data__option")

a little side effect here would also be allowing a syntax to rename attributes


For some quick dump of info because this idea has been coming and going in my head, the syntax would go something like this:

Which I believe should cover all usecases for this.

A tricky case I also thought about would be:

{
    "data": {
        "option": {}
    },
    "data__": {
        "option": {}
    }
}
class Obj(msgspec.Struct):
    data: Data
    data__: MoreData
    data_option: Option = msgspec.field(location="data__option")
    more_data_option: Option = msgspec.field(location="data____option")
    # or (which would be equivalent)
    some_data: Data = msgspec.field(location="data")
    some_more_data: Data = msgspec.field(location="data__")
    data_option: Option = msgspec.field(location="data__option")
    more_data_option: Option = msgspec.field(location="data____option")

In this case, the data fields will properly resolve and the distinction between flattening the stuct or not will be dictated based on whether the key exists or not, taking priority the first one.

For extreme cases that I don't believe can really be found in the wild, an extra arg to force a location to be treated as a flattenener could be added too.


I understand this could be a lot more work than is actually usefully, but I just wanted to dump the idea. I unfortunately don't have the C skills to try and implemt this myself, but would love to try.

Also interested in the limitations that you mentioned, as they might render my whole idea useless, as lack information on the internals of msgspec :sweat_smile:

Rogdham commented 1 year ago

The rename mechanism could be probably used for this (from the point of view of the user of the lib), something like this:

class Option(msgspec.Struct):
     x: int

foo_names= {
  "options": ["data", "options"],  # for example, TBD
}

class Foo(msgspec.Struct, rename=foo_names):
     version: int
     options: list[Option]

AFAIK Pydantic will support flattening in V2:

class Foo(BaseModel):
    bar: str = Field(aliases=[['baz', 2, 'qux']])

They have probably thought about edge cases, so it might be worth looking into as a good starting point.

ml31415 commented 1 year ago

@davfsa If it's only about making the parsed objects more usable, what about simply:

class Foo(msgspec.Struct):
    version: int
    data: ...

    @property
    def options(self):
        return self.data.options

You might even hide the original data field, having it renamed to e.g. _data.

mjkanji commented 1 year ago

I have a similar use case. This is what the data looks like:

{
    "username": "jcrist",
    "attributes": [
        {"Name": "first_name", "Value": "Jim"},
        {"Name": "last_name", "Value": "Crist"},
        ...
    ]
    ...
}

I'd like to model it such that the attribute keys (like first_name) and the corresponding Values are attributes of the Struct and also type validated. That is,

class User(Struct):
    username: str
    first_name: str
    last_name: str

msgspec.json.decode(data, type=User)
# > MyUser(username='jcrist', first_name='Jim', last_name='Crist')

Even if I created a new Attribute struct and set attributes: list[Attribute], there's no (obvious) way to validate the type of the Value based on what the Name is.

(PS: Not sure if this is the right issue to ask this; it seemed very similar to mine, but also slightly different because there's a level of...indirection(?), where the relevant key-value pairs are 'hidden' under the Name and Value keys of the list of dicts. Let me know if I should create a new issue instead.)


For reference, I found a solution to a similar problem using Pydantic's @root_validator(pre=True) decorator. [Stack Overflow comment, example code]

cutecutecat commented 11 months ago

Also met a similar case, I think these schema of data would happens frequently at a GraphQL API.

{
  "data":{
    "issues":{
      "nodes":[
        {
          "id":"12345"
        },
        {
          "id":"67890"
        }
      ]
    }
  }
}

Thanks for @ml31415 that https://github.com/jcrist/msgspec/issues/315#issuecomment-1572238202 helps a lot, but I still need to define 4 one-line-structs to express it. I would be really grateful if there could be a native support.

ml31415 commented 11 months ago

@mjkanji

What you could do is create tagged attribute objects. Then msgspec can distinguish them and you can add some verification.


class Attribute(msgspec.Struct, tag_field="Name")
    pass

class Firstname(Attribute, tag="first_name"):
    Value: str  # add validation for first_name here as required

class Lastname(Attribute, tag="last_name"):
    Value: str  # separate validation for last_name goes

Attribute = Firstname | Lastname

class User(msgspec.Struct):
    username: str
    attributes: list[Attribute]

Otherwise, if it's just about making the object easier to access, instead of modifying the data, just again use property. Roughly like that:

class User(msgspec.Struct):
    username: str
    attributes: list[Attribute]

    def _attribute_dict(self):
        return {attr.Name.lower(): attr.Value for attr in self.attributes}

    def __getattr__(self, attr):
        try:
            return self._attribute_dict()[attr]
        except KeyError:
            raise AttributeError(attr)
ml31415 commented 11 months ago

Hi @cutecutecat , if you don't care about further fields of "data" and "issues", just go with ordinary dictionaries and happily nest the type definition:

from typing import Literal

class Node(msgspec.Struct):
    id: int

class Container(msgspec.Struct):
    data: dict[Literal["issues"], dict[Literal["nodes"], list[Node]]]

    @property
    def nodes(self):
        return self.data["issues"]["nodes"]
>>> container = msgspec.json.decode(data, type=Container, strict=False)
>>> container.nodes
[Node(id=12345), Node(id=67890)]
notpushkin commented 2 months ago

I'm currently working on a Docker API client and flattening would be really useful.

For example, we have a struct like this:

class ServiceSpec(Struct):
    name: str
    labels: dict[str, str]
    image: str
    environment: list[str]

And Docker expects something like this:

{
  "Name": "web",
  "Labels": {"com.docker.example": "string"},
  "TaskTemplate": {
    "ContainerSpec": {
      "Image": "nginx:alpine",
      "Env": ["SECRET_KEY=123"]
    }
  }
}

To achieve this, I currently use the following hack:

Code ```py class DockerContainerSpec(Struct): image: str = field(name="Image") environment: list[str] = field(name="Env") @classmethod def from_spec(cls, spec: ServiceSpec): obj = msgspec.convert(spec, cls, from_attributes=True) return obj class DockerTaskTemplate(Struct): _container_spec: DockerContainerSpec = field(default=None, name="ContainerSpec") @classmethod def from_spec(cls, spec: ServiceSpec): obj = msgspec.convert(spec, cls, from_attributes=True) obj._container_spec = DockerContainerSpec.from_spec(spec) return obj class DockerService(Struct): name: str = field(name="Name") labels: dict[str, str] = field(name="Labels") _task_template: DockerTaskTemplate = field(default=None, name="TaskTemplate") @classmethod def from_spec(cls, spec: ServiceSpec): obj = msgspec.convert(spec, cls, from_attributes=True) obj._task_template = DockerTaskTemplate.from_spec(spec) return obj ```

This is a bit clumsy, but works out fairly well:

>>> spec = ServiceSpec(
...     name="app",
...     labels={},
...     image="nginx:alpine",
...     environment=["HELLO=world"]
... )
>>> msgspec.json.encode(DockerService.from_spec(spec))
b'{"Name":"app","Labels":{},"TaskTemplate":{"ContainerSpec":{"Image":"nginx:alpine","Env":["HELLO=world"]}}}'

UPD: this can be refactored as a wrapper for msgspec.convert: https://gist.github.com/notpushkin/3639f45acd2aa053b9d2416375135045 (see example at the bottom)