man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.51k stars 93 forks source link

Refactor DataFrameNormalizer to improve performance #1964

Open G-D-Petrov opened 3 weeks ago

G-D-Petrov commented 3 weeks ago

Reference Issues/PRs

Fixes #1963

What does this implement or fix?

This fix aims to reduce the number of calls to make_block and thus improve the performance of the post processing steps when there are multiple columns of the same type next to each other.

Note: there is not improvement when the columns are of different types

Any other comments?

Before the fix the code from the repro took:

8050444 function calls (7700439 primitive calls) in 4.323 seconds
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    2.822    2.822 _store.py:1831(_post_process_dataframe)

After the fix it took:

1679935 function calls (1503062 primitive calls) in 2.043 seconds
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005    0.594    0.594 _store.py:1831(_post_process_dataframe)

Checklist

Checklist for code changes... - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?