narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
613 stars 91 forks source link

perf: improve `ArrowGroupBy.__iter__` performances #1334

Closed FBruzzesi closed 2 weeks ago

FBruzzesi commented 2 weeks ago

What type of PR is this? (check all applicable)

Checklist

If you have comments or can explain your changes, please do so below.

According to Marco's performance benchmarking for plotly, the bottleneck for a few functions seems to be the call we do to ArrowGroupBy.__iter__.

Since pyarrow does not natively support iterating over groups, we (actually pointing finger to myself) implemented a (let's say naive) way of still allowing for that - I remember the use case was for scikit-lego to fully support arrow as well.

This PR tries to improve those performances using native arrow methods and no simple shortcuts. Steps are as follow: