marcosschroh / dataclasses-avroschema

Generate avro schemas from python classes. Code generation from avro schemas. Serialize/Deserialize python instances with avro schemas
https://marcosschroh.github.io/dataclasses-avroschema/
MIT License
211 stars 64 forks source link

Improve performance of serdes methods #536

Closed cristianmatache closed 3 weeks ago

cristianmatache commented 6 months ago

Is your feature request related to a problem? Please describe.

Serializing via the library is slower than using fastavro directly by orders of magnitude. This amounts to over 80% of the total time spent outside of the actual fastavro serdes. There are a few simple reasons:

- re-generating the schema(s) every time it serializes and deserializes is inefficient https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/schema_generator.py#L122-L123 https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/schema_generator.py#L131-L144

- re-computing the dacite config every time it deserializes (which in turn re-generates the schema) is inefficient

https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/schema_generator.py#L150 https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/schema_generator.py#L157-L158 https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/schema_generator.py#L173-L181

- re-evaluating whether a class is a pydantic model repeatedly when serializing is inefficient

https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/schema_generator.py#L122-L126

https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/utils.py#L67-L77 https://github.com/marcosschroh/dataclasses-avroschema/blob/8150aaab43d8b9f12af64bd96da5f6af5517b882/dataclasses_avroschema/utils.py#L17-L20

Describe the solution you'd like My suggestion would be:

Describe alternatives you've considered I think the simplest implementation would be caching the schema and the config in some global mappings (these mappings would be populated the first time serialize or deserialize is called on a certain class) However, it may be possible to cache the class's own resolved schema and/or the dacite config as new class attributes. This way, we don't interfere with the other usages of avro_schema_to_python (especially the ones that have a parent).

marcosschroh commented 6 months ago

Hi @cristianmatache

Definitely we need to improve it. Do you want to send a PR?

PS: Could you share the numbers and test cases when comparing with fastavro? Ideally, we should have almost the same speed because we use it as backend

cristianmatache commented 5 months ago

Hi @marcosschroh , doing the above would bring the performance closer to fastavro.

As for submitting a PR, the approval to contribute from my workplace may take way longer than the few lines of code that need to be changed, so I cannot commit to a timeline yet.

marcosschroh commented 5 months ago

No problem. Next week I will try to make some space to fix it. If something changes in the meant time let me know.

cristianmatache commented 5 months ago

Thank you!