godatadriven / pydantic-avro

This library can convert a pydantic class to a avro schema or generate python code from a avro schema.
https://github.com/godatadriven/pydantic-avro
MIT License
60 stars 30 forks source link

avsc_to_pydantic issue with nullable nested schemas #112

Open myjniak opened 2 months ago

myjniak commented 2 months ago

When encountering nested schemas, (type "record"), avsc_to_pydantic gives pydantic class attributes and the class name for that attribute exactly the same name. According to this comment, pydantic might not support that: https://github.com/pydantic/pydantic/issues/7871#issuecomment-1770958931 you can't name a field to have the same name as a model

Here's a code to reproduce this:

from pydantic_avro.avro_to_pydantic import avsc_to_pydantic

schema = {
    "type": "record",
    "name": "my_schema",
    "namespace": "my_namespace",
    "fields": [
        {
            "name": "nested",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "nested",
                    "namespace": "my_namespace.nested",
                    "fields": [
                        {
                            "name": "my_field",
                            "type": ["null", "string"],
                            "default": None
                        }
                    ]
                }
            ],
            "default": None
        }
    ]
}

pydantic_class: str = avsc_to_pydantic(schema)
print(pydantic_class)

Then we get this in return:

from datetime import date, datetime, time
from decimal import Decimal
from enum import Enum
from typing import List, Optional, Dict, Union
from uuid import UUID

from pydantic import BaseModel, Field

class nested(BaseModel):
    my_field: Optional[str] = None

class my_schema(BaseModel):
    nested: Optional[nested] = None

which will result in an exception, when instantiated:

my_schema(nested={"my_field": "value"})
Traceback (most recent call last):
  File "C:\Users\myjni\PycharmProjects\brudnopis\pydanticavrobug.py", line 52, in <module>
    pydantic_instance = my_schema(nested={"my_field": "value"})
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\myjni\PycharmProjects\brudnopis\.venv\Lib\site-packages\pydantic\main.py", line 171, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for my_schema
nested
  Input should be None [type=none_required, input_value={'my_field': 'value'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/none_required

The submodel class names should probably be transparent for the avsc_to_pydantic user, so my proposition to solve this problem would be to add a "Model" postfix to pydantic submodel class names:

class nestedModel(BaseModel):
    my_field: Optional[str] = None

class my_schema(BaseModel):
    nested: Optional[nestedModel] = None

If the class names are important for some reason though, then maybe field aliasing could help? Field(alias="something") Let me know what do you think :)

myjniak commented 1 month ago

I am open to make a Pull request for this as long as I have your opinion and guidelines :)

myjniak commented 1 month ago

After some more investigation I think the problem can be more generalized to repetition of field names and class names. Here's an example of class name overlap:

{
    "type": "record",
    "name": "possesions",
    "namespace": "my_namespace",
    "fields": [
        {
            "name": "car",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "car",
                    "fields": [
                        {
                            "name": "id",
                            "type": [
                                "null",
                                {
                                    "type": "record",
                                    "name": "id",
                                    "fields": [
                                        {
                                            "name": "vin",
                                            "type": ["null", "string"],
                                            "default": null
                                        },
                                        {
                                            "name": "vrn",
                                            "type": ["null", "string"],
                                            "default": null
                                        }
                                    ]
                                }
                            ],
                            "default": null
                        }
                    ]
                }
            ],
            "default": null
        },
        {
            "name": "house",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "house",
                    "fields": [
                        {
                            "name": "id",
                            "type": [
                                "null",
                                {
                                    "type": "record",
                                    "name": "id",
                                    "fields": [
                                        {
                                            "name": "land_and_mortgage_register_number",
                                            "type": ["null", "string"],
                                            "default": null
                                        }
                                    ]
                                }
                            ],
                            "default": null
                        }
                    ]
                }
            ],
            "default": null
        }
    ]
}

My idea to fix this would be to change class naming scheme to contain parent names as well. This should assure name uniqueness. In the above example we could generate this:

class possesions_car_id(BaseModel):
    vin: Optional[str] = None
    vrn: Optional[str] = None

class possesions_car(BaseModel):
    id: Optional[possesions_car_id] = None

class possesions_house_id(BaseModel):
    land_and_mortgage_register_number: Optional[str] = None

class possesions_house(BaseModel):
    id: Optional[possesions_house_id] = None

class possesions(BaseModel):
    car: Optional[possesions_car] = None
    house: Optional[possesions_house] = None

while currently the 2nd section named "id" overwrites the 1st one:

class id(BaseModel):
    land_and_mortgage_register_number: Optional[str] = None

class car(BaseModel):
    id: Optional[id] = None

class house(BaseModel):
    id: Optional[id] = None

class possesions(BaseModel):
    car: Optional[car] = None
    house: Optional[house] = None
dan1elt0m commented 4 weeks ago

Hey @myjniak, Thanks for opening this issue and the clear description! Models currently have to have a unique name, because they are referred to by their name. This is necessary as the models are exported to a single file. In addition, as you mentioned, Pydantic doesn't support having two different submodels with the same name in a single model.

So what to do about it? I agree, we shouldn't just ignore the bug.

I don't see how aliasing would solve the issue. So a fix indeed could be to rename non-unique model names. Problem is that we don't know upfront which names collide. In addition, current implementation cycles through fields from top level to bottom and fields don't store information about higher level fields. I got it locally to work with just comparing current model with existing model in the class registry and adding a hash to class name based on its children if it's different.

As I see it, To add parent info to the model name would require a preprocess iteration over the Avro schema. In this iterations the colliding classes are identified along with their parents. I welcome a PR to implement this. I can provide my WIP if you'd like.

Before this is finished, I think pydantic-avro should probably raise an error on name collision.