ijl / orjson

Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy
Apache License 2.0
6.07k stars 211 forks source link

Dataclass serialization includes non-fields. #83

Closed beanaroo closed 2 years ago

beanaroo commented 4 years ago

Thank you for the amazing project. I'm working on a dataclasses based library that processes very large and nested json output and I have seen amazing results.

While working on porting, I've noticed an inconsistency, in comparison to using asdict, where non-field attributes are being serialized.

Is this expected behaviour? It makes using dataclass fields for schema validation quite a bit harder.

import orjson
from dataclasses import dataclass, InitVar, asdict, fields

@dataclass
class Person:
    first_name: InitVar[str]
    last_name: InitVar[str]
    full_name: str = None

    def __post_init__(self, first_name: str, last_name: str):
        self._idx_name = f'{last_name}, {first_name[0]}'
        self.full_name = f'{first_name} {last_name}'

gary = Person(first_name='Gareth', last_name='Johnson')

print('FIELDS:', len(fields(gary)))
# FIELDS: 1

print('DICT:', asdict(gary))
# DICT: {'full_name': 'Gareth Johnson'}

print('JSON:', orjson.dumps(gary).decode())
# JSON: {"full_name":"Gareth Johnson","_idx_name":"Johnson, G"}
ijl commented 4 years ago

The optimization for dataclasses without __slots__ is to serialize __dict__. This use case could be solved by skipping serialization of attributes where the key has a leading underscore. How does that impact your library?

beanaroo commented 4 years ago

Interesting. Thank you for the insight.

The above is merely a sample. In my case, I'm actually storing non-protected InitVar fields for usage in external methods too. I'll see if __slots__ are an option as the model is quite static.

Dataclasses have __dataclass_fields__. Could this perhaps improve the portability of orjson in general?

ijl commented 4 years ago

I think with the distinction that dataclass are objects rather than dict it makes sense to not serialize attributes that being with a leading underscore--those are "private" in Python. But do you need more than that for "storing non-protected InitVar fields"--are those not with an underscore and you can't modify it?

It does currently read __dataclass_fields__ to duck type a dataclass. But iterating through that requires extra libpython function calls per attribute and so is much slower than using __dict__.

beanaroo commented 4 years ago

Ah, that makes sense. The performance hit is unfortunate.

From the comparisons in the documentation, I anticipated orjson to be compatible with asdict() and opened this issue as I thought it might be a bug.

But if the resulting output is different by design, please feel free to close. :slightly_smiling_face:


My specific case is due to the target schema.

Where the InitVar is used as a key for the relevant object instead of being an attribute as well as a part of a computed property reference. i.e.:


@dataclass
class SecurityGroup:
    resource_name: InitVar[str]
    to_port: int
    from_port: int

web_sg = SecurityGroup('webserver_ssl', from_port=443, to_port=443)

web_srv = Instance('webserver', security_group=web_sg.id)
{
  "resources": {
    "security_group": {
      "webserver_ssl": {
        "from_port": 443,
        "to_port": 443
      },
      "database_pgsql": {
        "from_port": 5432,
        "to_port": 5432
      }
    },
    "instance": {
      "webserver": {
        "name": "Backend",
        "security_group": "security_group.webserver_ssl.id"
      },
      "database": {
        "name": "PostgreSQL",
        "security_group": "security_group.database_pgsql.id"
      }
    }
  }
}

For now, I am calling asdict() and passing the result to orjson, this is obviously not as performant as using orjson directly, but still has significant benefit.

wyfo commented 4 years ago

I'm commenting here because I'm facing the same issue as @beanaroo I've indeed a (young) library of json (de)serialization with dataclasses, and I use a hidden attribute to track fields set in __init__ (in order to not serialize their default value) ; this attribute being a set, it makes orjson raise an error.

I think with the distinction that dataclass are objects rather than dict it makes sense to not serialize attributes that being with a leading underscore--those are "private" in Python.

@ijl I agree with your sentence and I'm all in favor of this; if you want, I can do a PR.

ijl commented 4 years ago

@wyfo ok, it does that in 3.0.2.

philer commented 3 years ago

I just got hit by this "fix". I was very confused when one of my keys got swallowed with no further comment.

orjson.dumps() does not serialize dataclasses.dataclass attributes that begin with a leading underscore, e.g., _attr. This is because of the Python idiom that a leading underscores marks an attribute as "private."

(3.0.2 release notes)

The thing is, I have underscored properties in my dataclasses which I want to get serialized. I'll describe my use case below but here are some general arguments first:

  1. Serialization of dataclasses should match the dataclasses.asdict() representation. This is not explicitly stated by the README but the comparison for benchmarking purpose kind of implies it.
  2. Underscored "private" properties are merely a convention and even if you follow that convention you may still want to serialize private properties (see below).
  3. It is inconsistent with how underscored dict keys are treated (orjson.dumps(dict(_key=1)) == b'{"_key":1}').

My use case is message passing between different processes (including different languages). I'm using a _type property to identify the message type so it can be deserialized to the correct type on the other end. Here's the relevant part of my base class:

@dataclass(frozen=True)
class Message(ABC):
    _type: str = field(init=False, repr=False, compare=False)

    types = dict()

    def __post_init__(self):
        super().__setattr__('_type', self.__class__.__name__)

    def __init_subclass__(cls):
        Message.types[cls.__name__] = cls

    @staticmethod
    def from_dict(data):
        return Message.types[data.pop("_type")](**data)

Note, that Message.from_dict does the inverse of dataclasses.asdict assuming the _type.

In short, please serialize underscore keys again or at least make it configurable (without having to overwrite the default serializer).

Whatever your decision, thanks for this tool and the work you're doing!

edit: If you'd prefer a separate issue for this just let me know. ;)