jpmorganchase / py-avro-schema

Generate Apache Avro schemas for Python types including standard library data-classes and Pydantic data models.
https://py-avro-schema.readthedocs.io/
Apache License 2.0
37 stars 6 forks source link

pydantic classes as default values not serializable #64

Closed dada-engineer closed 8 months ago

dada-engineer commented 8 months ago

Hi everyone,

I have a pydantic model that has an attribute of another pydantic model type that should be generated by default. This fails as the schema dict produced by py_avro_schema sets as default the pydantic class object, which is per default not json serializable via orjson.

example.py

import py_avro_schema as pas
from pydantic import BaseModel, Field

class Bar(BaseModel):
    baz: int = 0

class Foo(BaseModel):
    bar: Bar = Field(default_factory=Bar)

pas.generate(Foo)

This raises the following error:

Traceback (most recent call last):
  File "/Users/gellertd/workspace/procureai/foundation/constellation/example.py", line 13, in <module>
    pas.generate(Foo)
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/memoization/caching/plain_cache.py", line 42, in wrapper
    result = user_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gellertd/.pyenv/versions/constellation/lib/python3.11/site-packages/py_avro_schema/__init__.py", line 69, in generate
    schema_json = orjson.dumps(schema_dict, option=json_options)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Type is not JSON serializable: Bar

Suggestion

On a PydanticSchema for each recordfield check if the py_type is a pydantic basemodel subclass and if so call default.model_dump(mode="json") on it

This would then produce this schema:

{
  "type": "record",
  "name": "Foo",
  "fields": [
    {
      "name": "bar",
      "type": {
        "type": "record",
        "name": "Bar",
        "fields": [{ "name": "baz", "type": "long", "default": 0 }],
        "namespace": "__main__",
        "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
      },
      "default": { "baz": 0 }
    }
  ],
  "namespace": "__main__",
  "doc": "Usage docs: https://docs.pydantic.dev/2.6/concepts/models/"
}

which looks alright.

The downside is that pydantic would be needed at runtime so it must be either catched as an error or it would be imported on class / method level.

Let me know what you think about this please. Thanks a lot.

dada-engineer commented 8 months ago

Found a solution without importing pydantic at runtime, as long as you are fine with # type: ignore. Which I figured is okay as there have already been some.

dada-engineer commented 8 months ago

An alternative would be to allow setting orjson.dumpss default. This will make the json serialisation less performant though, according to orjson's docs (No problem for our usecase but might be for others)

dada-engineer commented 8 months ago

Personally I'd prefer to handle this within PydanticSchema class but I'd like to have your opinion on this.

Downside: This can not handle a dataclass that has an attribute that is a pydantic class with a default. Not sure how common this usecase is or I should add handling this to DataclassSchema as well. 😅

faph commented 8 months ago

ack

dada-engineer commented 8 months ago

@faph any news on this?

faph commented 8 months ago

Hi @dada-engineer

I am just trying to get my head around this and what the expected behavior should be. Reading the spec here: https://avro.apache.org/docs/1.11.1/specification/#schema-record makes me think that we should be following the Avro spec exactly on how field defaults should be serialized. There is only 1 correct serialization for a given type, so that would suggest we should NOT make the serialization configurable.

Instead, we might need to consider using the make_default methods consistently and recursively when generating defaults that a record type?

I have to say, I have not thought about using an entire Pydantic object as a default. I guess the Avro spec supports it, so py-avro-schema should support it too.

faph commented 8 months ago

So the problem here is that PydanticSchema itself does not implement make_default. It probably should and should do this recursively by use each field's schema's make_default.

dada-engineer commented 8 months ago

Thanks a lot for your feedback. This makes sense yes.

I am not sure if I understand how you want this to be implemented though.

faph commented 8 months ago

Let me push a concept PR that would demonstrate this.

dada-engineer commented 8 months ago

I think I have some idea. i'll close the merge request regarding orjson default, and update the other one. Maybe you can tell me if there are other more optimal fields to use 👍🏻

faph commented 8 months ago

It's not unusual for complex record fields like in your case to be "nullable" instead of defaulting to a complex/record object. It's up to you whether such a default should be present in the Avro schema itself.

For the Python class, it's not unusual for the class/function signature to default to None, but then in the body of __init__() such a field to be set with a given object. Certainly, for plain Python classes that do not support mutable defaults, you would use this approach.

dada-engineer commented 8 months ago

You want to tell me we should rethink our defaults? 🤔

faph commented 8 months ago

No, no, if that's the right design for you that's fine. Avro and Pydantic support it. Just saying that plain Python classes don't.

faph commented 8 months ago

So, this PR: #67 actually passes your test. But it would not be the correct solution.

Although arguably better than the current implementation. So we can think about whether we should merge this in the meantime like this...

dada-engineer commented 8 months ago

Updated the merge request. Feel free to have a look.

faph commented 8 months ago

That looks good, seems simpler than I expected, which is good!

faph commented 8 months ago

Released as 3.5.0.

Many thanks!

koenlek commented 1 month ago

For anyone interested, I managed to backport this to v2 (which I'm stuck on) with this diff on src/py_avro_schema/_schemas.py:

@@ -522,6 +522,12 @@
             "items": self.items_schema.data(names=names),
         }

+    def make_default(self, py_default: collections.abc.Sequence) -> JSONArray:
+        """Return an Avro schema compliant default value for a given Python Sequence
+        :param py_default: The Python sequence to generate a default value for.
+        """
+        return [self.items_schema.make_default(item) for item in py_default]
+

 class DictSchema(Schema):
     """An Avro map schema for a given Python mapping"""
@@ -841,6 +847,22 @@
         )
         return field_obj

+    def make_default(self, py_default: pydantic.BaseModel) -> JSONObj:
+        """Return an Avro schema compliant default value for a given Python value"""
+        return {key: _schema_obj(self._annotation(key)).make_default(value) for key, value in py_default}
+
+    def _annotation(self, field_name: str) -> Type:
+        """
+        Fetch the raw annotation for a given field name
+
+        Pydantic "unpacks" annotated and forward ref types in their FieldInfo API. We need to access to full, raw
+        annotated type hints instead.
+        """
+        for class_ in self.py_type.mro():
+            if class_.__annotations__.get(field_name):
+                return class_.__annotations__[field_name]
+        raise ValueError(f"{field_name} is not a field of {self.py_type}")  # Should never happen
+
     @staticmethod
     def _py_type(py_field: pydantic.fields.ModelField) -> Any:
         """Return a Python type annotation for a given Pydantic field"""