If you have comments or can explain your changes, please do so below.
According to Marco's performance benchmarking for plotly, the bottleneck for a few functions seems to be the call we do to ArrowGroupBy.__iter__.
Since pyarrow does not natively support iterating over groups, we (actually pointing finger to myself) implemented a (let's say naive) way of still allowing for that - I remember the use case was for scikit-lego to fully support arrow as well.
This PR tries to improve those performances using native arrow methods and no simple shortcuts. Steps are as follow:
Create an array containing the string concatenation of the key values (after casting to string). Null handling is required.
Add the column to the original table
Return the pair of :
key values, obtained as first (and unique) value of filtered table for the key names.
sliced dataframe, obtained as filtered table, and dropping the temporary column with string concatenation.
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below.
According to Marco's performance benchmarking for plotly, the bottleneck for a few functions seems to be the call we do to
ArrowGroupBy.__iter__
.Since pyarrow does not natively support iterating over groups, we (actually pointing finger to myself) implemented a (let's say naive) way of still allowing for that - I remember the use case was for scikit-lego to fully support arrow as well.
This PR tries to improve those performances using native arrow methods and no simple shortcuts. Steps are as follow: