marcosschroh / dataclasses-avroschema

Generate avro schemas from python classes. Code generation from avro schemas. Serialize/Deserialize python instances with avro schemas
https://marcosschroh.github.io/dataclasses-avroschema/
MIT License
213 stars 64 forks source link

Serdes speed-ups #677

Closed cmatache-bamfunds closed 1 month ago

cmatache-bamfunds commented 2 months ago

Fixes #536

marcosschroh commented 2 months ago

Thanks @cmatache-bamfunds

Could you share the improvements when comparing this PR against the current version? For example, how fast the serdes methods will be improved with the PR in %. Also, a comparison with fastavro would be great as well.

cmatache-bamfunds commented 2 months ago

Taking this example:

from dataclasses import dataclass
from enum import Enum

from dataclasses_avroschema import AvroModel

class En(Enum):
    a = 'a'
    b = 'b'

@dataclass
class Sch1(AvroModel):
    a: int
    b: float
    c: str
    d: list[str]
    e: En

@dataclass
class Sch2(AvroModel):
    x: dict[str, Sch1]

obj = Sch2(
    x={
        'SBGyKxjfdireQoNwlTnI': Sch1(
            a=2454, b=33083.8861890485, c='swZdoEtJjXUFjnqstXco', d=['lLGQNjeQRxpVGZuLpjZP'], e=En.a,
        ),
        'KgpkxONyPwzqwmZFzcpP': Sch1(
            a=4868, b=-783982970655.952, c='eAscpogeWcubcBndqSAs', d=['tIrpMFcKTfnKwZutHUBj'], e=En.b,
        )
    }
)
ser = obj.serialize()
assert Sch2.deserialize(ser) == obj

Performance before:

%timeit obj.serialize()  # 1.08 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit Sch2.deserialize(ser)  # 545 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Performance after:

%timeit obj.serialize()  # 90.1 µs ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit Sch2.deserialize(ser)  # 64.2 µs ± 2.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That is, a ~12X speed up for serialization, and ~8X for deserialization.

I am exposing the deserialize_to_python class method such that serdes can be sped up even further by going to pydantic v2 instead of dacite and asdict + standardize_custom_type for object reconstruction and dict dumps respectively. pydantic v2 is written in Rust and will be much faster in doing these two computations. For example, one can do:

from enum import Enum
from typing import Callable, Any, TypeVar, Type, cast

import pydantic
from pydantic.dataclasses import dataclass as pydantic_dataclass
from dataclasses_avroschema import AvroModel, JsonDict

@pydantic_dataclass(config=pydantic.ConfigDict(use_enum_values=True))
class AvroModelPydanticDataclass(AvroModel):
    # For serialization
    def asdict(self, standardize_factory: Callable[..., Any] | None = None) -> JsonDict:
        return pydantic.RootModel[type(self)].model_dump(self)  # type: ignore[misc,no-any-return]

# For deserialization
AvroT = TypeVar('AvroT', bound=AvroModelPydanticDataclass)

def deserialize_avro(cls: Type[AvroT], serialized: bytes, writer_schema: Type[AvroModel] | None = None) -> AvroT:
    return cast(AvroT, cls(**cls.deserialize_to_python(serialized, writer_schema=writer_schema)))

class En(Enum):
    a = 'a'
    b = 'b'

@pydantic_dataclass
class Sch1(AvroModelPydanticDataclass):
    a: int
    b: float
    c: str
    d: list[str]
    e: En

@pydantic_dataclass
class Sch2(AvroModelPydanticDataclass):
    x: dict[str, Sch1]

obj = Sch2(
    x={
        'SBGyKxjfdireQoNwlTnI': Sch1(
            a=2454, b=33083.8861890485, c='swZdoEtJjXUFjnqstXco', d=['lLGQNjeQRxpVGZuLpjZP'], e=En.a,
        ),
        'KgpkxONyPwzqwmZFzcpP': Sch1(
            a=4868, b=-783982970655.952, c='eAscpogeWcubcBndqSAs', d=['tIrpMFcKTfnKwZutHUBj'], e=En.b,
        )
    }
)
ser = obj.serialize()
assert deserialize_avro(Sch2, ser) == obj

%timeit obj.serialize()  # 58.6 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit deserialize_avro(Sch2, ser)  # 23.7 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This opens up the possibility for the users to achieve an ~18X improvement in serialization and ~23X in deserialization. Other datatypes should still function properly (e.g., datetime).

marcosschroh commented 1 month ago

Thanks @cmatache-bamfunds Could you fix the code? The tests are failing

cristianmatache commented 1 month ago

Thanks @cmatache-bamfunds Could you fix the code? The tests are failing

Hi @marcosschroh , this PR is not introducing any new test failures. This PR had a few false positives in mypy which caused the build to fail. The tests failing in my PR were already failing in master. See https://github.com/marcosschroh/dataclasses-avroschema/actions/runs/10007154620/job/27661267481 image

The reason the build is not failing is because the exit code of ./scripts/test only depends on the exit code of mypy https://github.com/marcosschroh/dataclasses-avroschema/blob/38ceb46a7d4343dc03cca58f8e121ca9d7f251c5/scripts/test#L10-L12. That is, it doesn't matter if the tests are failing or not, as long as mypy passes the build (as a whole) will pass.

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.48%. Comparing base (38ceb46) to head (40e4398).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #677 +/- ## ======================================= Coverage 99.47% 99.48% ======================================= Files 34 34 Lines 1909 1927 +18 ======================================= + Hits 1899 1917 +18 Misses 10 10 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

marcosschroh commented 1 month ago

No @cristianmatache , it does not work in that way. If the tests, linting or mypy fail then the script finishes with a code different than 0, then the script fails. Just try to do a dummy example adding assert False in any test and you will see that the linting and mypy do not run.

Then, what you see in the image 4 xfailed has a purpose. The "failing" tests are the one marked as failing which I already know before hand that they are failing.

marcosschroh commented 1 month ago

Just check the action https://github.com/marcosschroh/dataclasses-avroschema/actions/runs/10009999830/job/27670163667#step:5:164. You see that it failed because of dataclasses_avroschema/utils.py:82: error: Argument 1 to "__call__" of "_lru_cache_wrapper" has incompatible type "type[Any]"; expected "Hashable" [arg-type] Found 1 error in 1 file (checked 40 source files) (A change introduced in the PR)

cristianmatache commented 1 month ago

You're right, I missed this https://github.com/marcosschroh/dataclasses-avroschema/blob/38ceb46a7d4343dc03cca58f8e121ca9d7f251c5/scripts/test#L1

I know what xfail means, but by test failures, I thought you meant pytest failures, because the only issues introduced in the PR were mypy failures. Those are more like lint/static checks than actual tests, hence my confusion. Anyway, I added type: ignore for the mypy failures.