Closed kevinkjt2000 closed 3 years ago
Hi @kevinkjt2000
Makes sense your feature request. I would like to not use the default value for this feature (I am trying to move away from default values that define extra properties), so the end user will write less code.
What I would like is to have an extra property in the Meta
class that defines the names for the nested values, keeping the current behaviour as default. Something like:
import dataclasses
from dataclasses_avroschema import AvroModel
@dataclasses.dataclass
class Address(AvroModel):
stuff: str = ""
@dataclasses.dataclass
class Person(AvroModel):
address: Address
class Meta:
nested_field_names = {
"address": "Address",
...
}
So, if the nested_field_names
is defined in class Meta
use it, otherwise use the current behaviour.
what do you think?
Looks like more code to me? Considering that I'm already using dataclasses.field
to populate the initializer (default_factory
) and there is already a metadata parameter. I notice that this library is already using metadata from fields because I can override lowercase address_one
's name with this:
@dataclasses.dataclass
class Person(AvroModel):
address_one: Address = dataclasses.field(default_factory=Address, metadata={"name": "address1", "type_name": "Address"})
Obviously, this example is a little contrived. I would not intentionally make names be different since that would be confusing to debug later, but I've seen worse things happen in production code and I've experienced some scenarios that have used different names across API boundaries.
Another reason Meta.nested_field_names
looks like more code is because of how similar that dictionary looks to what is already declared in the dataclass. Actually, it's not just similar, it's exactly the same. To still use Meta
and shorten what is typed in it, I could see adding an option in Meta
for a custom name generator function (i.e. a lambda that takes the field name and CamelCases it or something) or perhaps a boolean toggle to simply use the class name as is (i.e. just use __name__
unmodified instead of doing .lower() and + _record). Honestly, that last option is what I had expected as the default.
I briefly considered what it would be like to approach this from an opposite direction too. From what I've read about Avro online, Java classes are generated from the typenames, and by convention, those are capitalized. Python has that same convention too where class names are capitalized. If I were to take an Avro schema and feed it through a program that generated a Python dataclass, I would expect the typenames to be the name of the generated class. It would be odd to see snake-cased class in Python:
class address_record:
stuff: str = ""
Hi @kevinkjt2000
Yes, we are using the metadata
option in order to Adding Custom Field-level Attributes
, I did not know that also was possible to override the nested name using the metadata
.
Your proposal is good, but I do not want the end users to star defining default values for each field in order to override names or add custom attributes. My idea is to use a configuration style, like pydantic
is doing, for example: https://pydantic-docs.helpmanual.io/usage/model_config/. Would be nice to involve more people on this decision.
At the beginning I was using __name__
, but there was a bug the recursive nested fields I think, so I changed it, but we could try to fix it in order to use __name__
again.
If I were to take an Avro schema and feed it through a program that generated a Python dataclass, I would expect the typenames to be the name of the generated class
Yes, you are right if you want to perform a reverse. If you are using this library, of course you do not need it, but it is a good point.
Long story short, I think it is good that you found your way to change the nested names, even if this is not the nicest way (maybe we should document it). What I would suggest is create a new issue
and debate over using the class Meta
for Adding Custom Field-level Attributes
and redefine nested names
+ more things
or keep using the metadata at level field. what do you think?
I did not know that also was possible to override the nested name using the metadata.
It's not. I tried to carefully distinguish typename vs name, but it seems that I made this more confusing. My bad. That's the point of me opening the issue; I can't override the typenames as is.
Your proposal is good
Which one? Now there are 3 of them (including the original one that was rejected?).
but I do not want the end users to star defining default values for each field
The default_factory
parameter is optional. One could do this:
import dataclasses
@dataclasses.dataclass
class Something:
x: int = dataclasses.field(metadata={"foo": "bar"})
Trying to construct Something()
would result in TypeError: __init__() missing 1 required positional argument: 'x'
. In other words, you would not be forcing users of this library to initialize a default value. Given the syntax, I realize it looks that way, but since dataclasses is generating the __init__
method, using dataclasses.fields
for only providing metadata does not default any values.
If you are using this library, of course you do not need it, but it is a good point.
If this library were able to onboard people easily by taking schema as input and outputting the Python class, that would be really sweet. At the moment, I'm manually (and slowly) converting ~30 schemas.
At the beginning I was using name, but there was a bug the recursive nested fields I think, so I changed it, but we could try to fix it in order to use name again.
I'm open to a breaking change of defaulting to the unmodified __name__
again. I don't have any recursive schemas that would suffer from this.
What I would suggest is create a new issue and debate over using the class Meta for Adding Custom Field-level Attributes and redefine nested names + more things or keep using the metadata at level field. what do you think?
Hmmm, I can't think of any customizations that would be worth it if __name__
is used.
Here's another odd thing I just realized, I can assert that the Address schema nested within Person is actually a different schema (only by name of course):
assert Address.avro_schema_to_python() == Person.avro_schema_to_python()["fields"][0]["type"]
Yeah, the more I think about using __name__
again, the more I like it. I'll update the title.
Hi @kevinkjt2000
You can take a look this branch: https://github.com/marcosschroh/dataclasses-avroschema/tree/feat/alias-nested-items
I took a different approach. You can use the attribute alias_nested_items
in the class Meta
class Address(AvroModel):
"An Address"
street: str
street_number: int
class User(AvroModel):
"An User with Address"
name: str
age: int
address: Address # default name address_record
class Meta:
alias_nested_items = {"address": "Address"}
will generate:
{
"type": "record",
"name": "User",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "age",
"type": "long"
},
{
"name": "address",
"type": {
"type": "record",
"name": "Address",
"fields": [
{
"name": "street",
"type": "string"
},
{
"name": "street_number",
"type": "long"
}
],
"doc": "An Address"
}
}
],
"doc": "An User with Address"
}
without using the property will be:
{
"type": "record",
"name": "User",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "age",
"type": "long"
},
{
"name": "address",
"type": {
"type": "record",
"name": "address_record",
"fields": [
{
"name": "street",
"type": "string"
},
{
"name": "street_number",
"type": "long"
}
],
"doc": "An Address"
}
}
],
"doc": "An User with Address"
}
Reiterating this from the end of my last post: to me, it makes a lot of sense for this assertion to hold true assert Address.avro_schema_to_python() == Person.avro_schema_to_python()["fields"][0]["type"]
. In your example that would be assert Address.avro_schema_to_python() == User.avro_schema_to_python()["fields"][2]["type"]
. Which holds true, but why should one have to repeat x: X
in the same dataclass? This approach doesn't seem DRY to me.
They are the same schema after all, right? Since the class __name__
is used for the schema type at the top-level of a record schema, should nested record schemas do the same by default? Am I wrong in suggesting that a schema should be equivalent to a nested version of itself by default?
In other words,
class Foo(AvroModel):
pass
class Container(AvroModel):
foo: Foo
for all Foo
that are nested inside some Container
, should the corresponding Foo
subschema from Container
should be equivalent to Foo
's schema?
Hi @kevinkjt2000
Thanks for your feedback. For now I think is good enough. With this new feature covers your demand. I will try to publish a new version asap
Thanks for doing this! The feature would definitely cover the original ask, but defaulting to __name__
would eliminate my need for wanting to customize record type names in the first place. I renamed the issue because of how much I liked the idea of changing the default back to using undecorated __name__
. Meta.alias_nested_items
seems reasonable to me if you wanted to be more aligned with Pydantic principles and less dependent on dataclasses.
I'd begrudgingly type out everything twice in my case to use the feature, and wish for __name__
as a more reasonable default. Existing libraries in other languages have exactly this behavior, especially the ones that generate code from models since CamelCased class names are generally standard. Code -> Schema -> Code -> Schema -> Code is a transition I would like to be able to make without seeing _record_record
tacked on in places (by default). I'm also concerned with someone accidentally forgetting to type the Meta
class. Imagine finding later that someone writes some code that iterates over the schema hoping to find the name of the class be the record type name (completely reasonable assumption if all the other schemas are like that). To their surprise, they need to search for {class_name.lower()}_record
or add the missing duplicate entry to Meta.alias_nested_items
. Before they would have searched for {class_name}
because that's how all the Avro schemas are at the moment. At least that's how it is on this project and examples I've seen in my limited Googling of other Avro schema libraries for other languages. (Avro is new to me so establishing what these standards are is something I'm still learning.)
Hopefully, some others could weigh in with their own opinions on whether they would want the record type names to match in the ways I've pointed out in previous comments:
__name__
; nesting them suddenly changes this to {__name__.lower()}_record
or {__name__.lower()}_{record_name}_record
based on a conditional
Is your feature request related to a problem? Please describe. To ease the burden of converting existing schemas into dataclass format, it would be nice to have the generated schema be an exact match. Mismatching names on existing schemas can lead to the tremendous task of updating names in many places for a widely used schema in a large codebase.
Describe the solution you'd like Looking at the documentation for dataclasses, I noticed that there is a
metadata
parameter ondataclasses.field
that would probably be perfect for this. I'm picturing something like:Instead of
the generated schema would be
Describe alternatives you've considered Keeping what is already implemented in the form of hand-coded avro schema dictionaries 😢 Or patching the overrides in:
Additional context Record type names are already generated by the library from what I can tell https://marcosschroh.github.io/dataclasses-avroschema/schema_relationships/#avoid-name-colision-in-multiple-relationships https://github.com/marcosschroh/dataclasses-avroschema/blob/9fb45689d5138c48b392b38283e8771c97ef501f/dataclasses_avroschema/fields.py#L643-L647