Use Python class __name__ for nested record typenames.

marcosschroh / dataclasses-avroschema

Generate avro schemas from python dataclasses, Pydantic models and Faust Records. Code generation from avro schemas. Serialize/Deserialize python instances with avro schemas.

https://marcosschroh.github.io/dataclasses-avroschema/

MIT License

219 stars 67 forks source link

Use Python class name for nested record typenames. #100

Closed kevinkjt2000 closed 3 years ago

kevinkjt2000 commented 3 years ago

Is your feature request related to a problem? Please describe. To ease the burden of converting existing schemas into dataclass format, it would be nice to have the generated schema be an exact match. Mismatching names on existing schemas can lead to the tremendous task of updating names in many places for a widely used schema in a large codebase.

Describe the solution you'd like Looking at the documentation for dataclasses, I noticed that there is a metadata parameter on dataclasses.field that would probably be perfect for this. I'm picturing something like:

import dataclasses
from dataclasses_avroschema import AvroModel

@dataclasses.dataclass
class Address(AvroModel):
    stuff: str = ""

@dataclasses.dataclass
class Person(AvroModel):
    address: Address = dataclasses.field(default_factory=Address, metadata={"type_name": "Address"}) # I'd also be fine with record_type_name or something similar

Instead of

{'type': 'record',
 'name': 'Person',
 'fields': [{'name': 'address',
   'type': {'type': 'record',
    'name': 'address_record',
    'fields': [{'name': 'stuff', 'type': 'string', 'default': ''}],
    'doc': "Address(stuff: str = '')"}}],
 'doc': 'Person(address: __main__.Address = <factory>)'}

the generated schema would be

{'type': 'record',
 'name': 'Person',
 'fields': [{'name': 'address',
   'type': {'type': 'record',
    'name': 'Address',
    'fields': [{'name': 'stuff', 'type': 'string', 'default': ''}],
    'doc': "Address(stuff: str = '')"}}],
 'doc': 'Person(address: __main__.Address = <factory>)'}

Describe alternatives you've considered Keeping what is already implemented in the form of hand-coded avro schema dictionaries 😢 Or patching the overrides in:

person_schema = Person.avro_schema_to_python()
person_schema["fields"][0]["name"] = "Address"  # Of course, looping over the array to find things with a matching field name that have the _record version is what I would do instead

Additional context Record type names are already generated by the library from what I can tell https://marcosschroh.github.io/dataclasses-avroschema/schema_relationships/#avoid-name-colision-in-multiple-relationships https://github.com/marcosschroh/dataclasses-avroschema/blob/9fb45689d5138c48b392b38283e8771c97ef501f/dataclasses_avroschema/fields.py#L643-L647

marcosschroh commented 3 years ago

Hi @kevinkjt2000

Makes sense your feature request. I would like to not use the default value for this feature (I am trying to move away from default values that define extra properties), so the end user will write less code.

What I would like is to have an extra property in the Meta class that defines the names for the nested values, keeping the current behaviour as default. Something like:

import dataclasses
from dataclasses_avroschema import AvroModel

@dataclasses.dataclass
class Address(AvroModel):
    stuff: str = ""

@dataclasses.dataclass
class Person(AvroModel):
    address: Address

    class Meta:
        nested_field_names = {
           "address": "Address",
            ...
        }

So, if the nested_field_names is defined in class Meta use it, otherwise use the current behaviour.

what do you think?

kevinkjt2000 commented 3 years ago

Looks like more code to me? Considering that I'm already using dataclasses.field to populate the initializer (default_factory) and there is already a metadata parameter. I notice that this library is already using metadata from fields because I can override lowercase address_one's name with this:

@dataclasses.dataclass
class Person(AvroModel):
    address_one: Address = dataclasses.field(default_factory=Address, metadata={"name": "address1", "type_name": "Address"})

Obviously, this example is a little contrived. I would not intentionally make names be different since that would be confusing to debug later, but I've seen worse things happen in production code and I've experienced some scenarios that have used different names across API boundaries.

Another reason Meta.nested_field_names looks like more code is because of how similar that dictionary looks to what is already declared in the dataclass. Actually, it's not just similar, it's exactly the same. To still use Meta and shorten what is typed in it, I could see adding an option in Meta for a custom name generator function (i.e. a lambda that takes the field name and CamelCases it or something) or perhaps a boolean toggle to simply use the class name as is (i.e. just use __name__ unmodified instead of doing .lower() and + _record). Honestly, that last option is what I had expected as the default.

I briefly considered what it would be like to approach this from an opposite direction too. From what I've read about Avro online, Java classes are generated from the typenames, and by convention, those are capitalized. Python has that same convention too where class names are capitalized. If I were to take an Avro schema and feed it through a program that generated a Python dataclass, I would expect the typenames to be the name of the generated class. It would be odd to see snake-cased class in Python:

class address_record:
    stuff: str = ""

marcosschroh commented 3 years ago

Hi @kevinkjt2000

Yes, we are using the metadata option in order to Adding Custom Field-level Attributes, I did not know that also was possible to override the nested name using the metadata.

Your proposal is good, but I do not want the end users to star defining default values for each field in order to override names or add custom attributes. My idea is to use a configuration style, like pydantic is doing, for example: https://pydantic-docs.helpmanual.io/usage/model_config/. Would be nice to involve more people on this decision.

At the beginning I was using __name__, but there was a bug the recursive nested fields I think, so I changed it, but we could try to fix it in order to use __name__ again.

If I were to take an Avro schema and feed it through a program that generated a Python dataclass, I would expect the typenames to be the name of the generated class

Yes, you are right if you want to perform a reverse. If you are using this library, of course you do not need it, but it is a good point.

Long story short, I think it is good that you found your way to change the nested names, even if this is not the nicest way (maybe we should document it). What I would suggest is create a new issue and debate over using the class Meta for Adding Custom Field-level Attributes and redefine nested names + more things or keep using the metadata at level field. what do you think?

kevinkjt2000 commented 3 years ago

I did not know that also was possible to override the nested name using the metadata.

It's not. I tried to carefully distinguish typename vs name, but it seems that I made this more confusing. My bad. That's the point of me opening the issue; I can't override the typenames as is.

Your proposal is good

Which one? Now there are 3 of them (including the original one that was rejected?).

but I do not want the end users to star defining default values for each field

The default_factory parameter is optional. One could do this:

import dataclasses

@dataclasses.dataclass
class Something:
    x: int = dataclasses.field(metadata={"foo": "bar"})

Trying to construct Something() would result in TypeError: __init__() missing 1 required positional argument: 'x'. In other words, you would not be forcing users of this library to initialize a default value. Given the syntax, I realize it looks that way, but since dataclasses is generating the __init__ method, using dataclasses.fields for only providing metadata does not default any values.

If you are using this library, of course you do not need it, but it is a good point.

If this library were able to onboard people easily by taking schema as input and outputting the Python class, that would be really sweet. At the moment, I'm manually (and slowly) converting ~30 schemas.

At the beginning I was using name, but there was a bug the recursive nested fields I think, so I changed it, but we could try to fix it in order to use name again.

I'm open to a breaking change of defaulting to the unmodified __name__ again. I don't have any recursive schemas that would suffer from this.

What I would suggest is create a new issue and debate over using the class Meta for Adding Custom Field-level Attributes and redefine nested names + more things or keep using the metadata at level field. what do you think?

Hmmm, I can't think of any customizations that would be worth it if __name__ is used.

Here's another odd thing I just realized, I can assert that the Address schema nested within Person is actually a different schema (only by name of course):

assert Address.avro_schema_to_python() == Person.avro_schema_to_python()["fields"][0]["type"]

Yeah, the more I think about using __name__ again, the more I like it. I'll update the title.

marcosschroh commented 3 years ago

Hi @kevinkjt2000

You can take a look this branch: https://github.com/marcosschroh/dataclasses-avroschema/tree/feat/alias-nested-items

I took a different approach. You can use the attribute alias_nested_items in the class Meta

class Address(AvroModel):
    "An Address"
    street: str
    street_number: int

class User(AvroModel):
    "An User with Address"
    name: str
    age: int
    address: Address  # default name address_record

    class Meta:
        alias_nested_items = {"address": "Address"}

will generate:

{
    "type": "record",
    "name": "User",
    "fields": [
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "age",
            "type": "long"
        },
        {
            "name": "address",
            "type": {
                "type": "record",
                "name": "Address",
                "fields": [
                    {
                        "name": "street",
                        "type": "string"
                    },
                    {
                        "name": "street_number",
                        "type": "long"
                    }
                ],
                "doc": "An Address"
            }
        } 
    ],
    "doc": "An User with Address"
}

without using the property will be:

{
    "type": "record",
    "name": "User",
    "fields": [
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "age",
            "type": "long"
        },
        {
            "name": "address",
            "type": {
                "type": "record",
                "name": "address_record",
                "fields": [
                    {
                        "name": "street",
                        "type": "string"
                    },
                    {
                        "name": "street_number",
                        "type": "long"
                    }
                ],
                "doc": "An Address"
            }
        } 
    ],
    "doc": "An User with Address"
}

kevinkjt2000 commented 3 years ago

Reiterating this from the end of my last post: to me, it makes a lot of sense for this assertion to hold true assert Address.avro_schema_to_python() == Person.avro_schema_to_python()["fields"][0]["type"]. In your example that would be assert Address.avro_schema_to_python() == User.avro_schema_to_python()["fields"][2]["type"]. Which holds true, but why should one have to repeat x: X in the same dataclass? This approach doesn't seem DRY to me.

They are the same schema after all, right? Since the class __name__ is used for the schema type at the top-level of a record schema, should nested record schemas do the same by default? Am I wrong in suggesting that a schema should be equivalent to a nested version of itself by default?

In other words,

class Foo(AvroModel):
     pass

class Container(AvroModel):
     foo: Foo

for all Foo that are nested inside some Container, should the corresponding Foo subschema from Container should be equivalent to Foo's schema?

marcosschroh commented 3 years ago

Hi @kevinkjt2000

Thanks for your feedback. For now I think is good enough. With this new feature covers your demand. I will try to publish a new version asap

kevinkjt2000 commented 3 years ago

Thanks for doing this! The feature would definitely cover the original ask, but defaulting to __name__ would eliminate my need for wanting to customize record type names in the first place. I renamed the issue because of how much I liked the idea of changing the default back to using undecorated __name__. Meta.alias_nested_items seems reasonable to me if you wanted to be more aligned with Pydantic principles and less dependent on dataclasses.

I'd begrudgingly type out everything twice in my case to use the feature, and wish for __name__ as a more reasonable default. Existing libraries in other languages have exactly this behavior, especially the ones that generate code from models since CamelCased class names are generally standard. Code -> Schema -> Code -> Schema -> Code is a transition I would like to be able to make without seeing _record_record tacked on in places (by default). I'm also concerned with someone accidentally forgetting to type the Meta class. Imagine finding later that someone writes some code that iterates over the schema hoping to find the name of the class be the record type name (completely reasonable assumption if all the other schemas are like that). To their surprise, they need to search for {class_name.lower()}_record or add the missing duplicate entry to Meta.alias_nested_items. Before they would have searched for {class_name} because that's how all the Avro schemas are at the moment. At least that's how it is on this project and examples I've seen in my limited Googling of other Avro schema libraries for other languages. (Avro is new to me so establishing what these standards are is something I'm still learning.)

Hopefully, some others could weigh in with their own opinions on whether they would want the record type names to match in the ways I've pointed out in previous comments:

Top-level record type names already use __name__; nesting them suddenly changes this to {__name__.lower()}_record or {__name__.lower()}_{record_name}_record based on a conditional
Generating code from schemas (not a feature this library has (yet?)) leads to non-pythonic snake_cased class names instead of CamelCase
Don't Repeat Yourself (DRY) principle is violated (usually not the most important, but when the repetitions are so close together that they are in the same class, I feel strongly urged to point this principle out)

marcosschroh commented 3 years ago

closed by https://github.com/marcosschroh/dataclasses-avroschema/pull/110

marcosschroh / dataclasses-avroschema

Use Python class __name__ for nested record typenames. #100

Use Python class name for nested record typenames. #100