Handling of schemas whose fields are not in accordance with the rules for Python variables

marcosschroh / dataclasses-avroschema

Generate avro schemas from python classes. Code generation from avro schemas. Serialize/Deserialize python instances with avro schemas

https://marcosschroh.github.io/dataclasses-avroschema/

MIT License

213 stars 65 forks source link

Handling of schemas whose fields are not in accordance with the rules for Python variables #628

Open anneum opened 4 months ago

anneum commented 4 months ago

Is your feature request related to a problem? Please describe. When I convert an Avro schema whose fields do not match the rules for Python variables to a Pydantic (or Avrodantic) model, I get a non-valid model.

For example, if I convert the following schema, I get an invalid model:

{
  "type": "record",
  "name": "SchemaAvroBenchmark",
  "fields": [
    {
      "name": "scope-case",
      "type": "string"
    }
  ]
}

Invalid value due to - in the variable name

from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope-case: str

In my opinion, the serialization_alias from Fields should be used for this.

from dataclasses_avroschema.pydantic import AvroBaseModel
from pydantic import Field

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(serialization_alias='scope-case')

When converting the valid model to an Avro schema, the serialization_alias should be used as the field name.

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "aliases": [
                "scope-case"
            ],
            "name": "scope_case",
            "type": "string"
        }
    ]
}

To summarize: Converting from a schema to a model and then back to a schema should result in the same schema.

I am already working on the implementation and getting the right result for such a simple schema (except for importing fields), but would appreciate support. In particular, I am having a lot of trouble with union cases.

marcosschroh commented 4 months ago

Hi @anneum

Yes, we should definitely generate proper class attributes names in snakecase. However, we should not uses aliases unless the original schema has it. The reason is that we should keep the same schema, if we change the field name and we add an aliases then the schemas (original and generated from model) will be different. Did you try to use the case funcionality?

marcosschroh commented 4 months ago

Following your example, you could do the following (if the field is generated properly):

from dataclasses_avroschema.pydantic import AvroBaseModel
from dataclasses_avroschema import case

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str

print(SchemaAvroBenchmark.avro_schema(case_type=case.SPINALCASE))

{"type": "record", "name": "SchemaAvroBenchmark", "fields": [{"name": "scope-case", "type": "string"}]}

marcosschroh commented 4 months ago

@anneum if you update to the latest version (0.59.1) now it will generate valid python variables.

anneum commented 4 months ago

That's great, thanks for that. I see a little problem when we specify the case. It also overwrites field names that do not match the case.

Original Schema:

{
  "type": "record",
  "name": "SchemaAvroBenchmark",
  "fields": [
    {
      "name": "scope-case",
      "type": "string"
    },
    {
      "name": "scope_case2",
      "type": "string"
    }
  ]
}

After the model_generator.render(schema=schema, model_type=ModelType.AVRODANTIC.value):

from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str
    scope_case2: str

print(SchemaAvroBenchmark.avro_schema(case_type=case.SPINALCASE))

After the conversion into a avro schema:

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "name": "scope-case",
            "type": "string"
        },
        {
            "name": "scope-case-2",
            "type": "string"
        }
    ]
}

Therefore, I suggest that we use the pydantic field with the serialization_alias parameter and instead of converting it to an alias as part of the schema, use it to store the original name.

The avro_schema() method should then use the serialization_alias (if available) as the field name instead of the attribute name when creating the schema.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(..., serialization_alias='scope-case')
    scope_case2: str

print(SchemaAvroBenchmark.avro_schema())

anneum commented 4 months ago

As an addition, I think the serialization_alias parameter is meant for just such cases. https://docs.pydantic.dev/latest/concepts/fields/#field-aliases

marcosschroh commented 4 months ago

Ok, we can add the alias to the generated model. We need to take into account that if the field has aliases already defined then it will be a bit weird and serialization_alias is a string, not list. Example:

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "aliases": [
                "my-scope-case",
                "renamed-scope-case"
            ],
            "name": "scope-case",
            "type": "string"
        }
    ]
}

Maybe we should use the metadata to add the aliases instead of serialization_alias? In this case the generated model should be:

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(metadata={"aliases": ["scope-case", ...])

Then when generating the schema from the model, we will have the extra alias "scope-case". Does it work for you? If it does, then it is quite easy to implement.

anneum commented 3 months ago

The metadata field sounds like a good option, but I would use it as a string instead of a list. This is because there can only be one alias that could not be used as a field name because of its name. So if the field name in the schema is not valid (according to python rules), we add the original field name as an alias in the metadata and change the field name for the variable label in the pydantic model to a valid name.

The reason I would have liked to use the serialization_alias is the ability built into pydantic to dump the object with .model_dump(by_alias=True). This allows me to bring objects directly into the shape needed for the schema. This kills two birds with one stone.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(..., serialization_alias='scope-case')
    scope_case2: str

SchemaAvroBenchmark(scope_case='foo', scope_case2='bar').model_dump(by_alias=True)
# {'scope-case': 'foo', 'scope_case2': 'bar'}

marcosschroh commented 3 months ago

Question: are you using mudel_dump(by_alias=True) to generate the payload, then encode it and send it to kafka? I am asking because the way to do it is using serialize.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(serialization_alias='scope-case')

# serialize to avro-json to send to kafka just to see the fields (the same will happen with avro-binary)
benchmark = SchemaAvroBenchmark.fake()

print(benchmark, "\n\n")
>>> scope_case='FbeDbPMeawuTwxUbhSaY'

# This can be send to kafka (it is bytes)
ser = benchmark.serialize(serialization_type="avro-json")

# It will produce an event with the field `scope_case` and not with the alias. 
print(set)
>>> b'{"scope_case": "FbeDbPMeawuTwxUbhSaY"}'

# This is with alias, but they are not bytes. Do not send to kafka.
print(benchmark.model_dump(by_alias=True))
>>> {"scope-case'": 'FbeDbPMeawuTwxUbhSaY'}

I will work in a PR to add the alias, working as:

Generate a valid identifier to use in the python class
The original field name will be added in the serialization_alias
If the original field has more aliases, then the python class must contain serialization_alias and metadata={"aliases": [ ...]}