Closed pstorozenko closed 1 year ago
I dont think groupby operations based on pyarrow has been implemented. I'd expect that it would probably be passed on to the Arrow Table implementation.
Thanks but this is a duplicate of https://github.com/pandas-dev/pandas/issues/52070.
pandas does not dispatch arrow dtypes to Arrow's groupby implementation yet but will probably come in the next few releases.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
There's a huge regression (70x) in
groupby
+sum
when using pyarrow as backend. The file used: wiki100.zip is a small subset of wiki clickstream dataset for March 2022. This zip contains a parquet single file. The regression is the same when working with larger subset when it takes minutes to run pyarrow, compared to seconds in numpy backed Series.P.S. Resetting index didn't change anything.
P.S.2 Regression stays the same if I run
instead.
Installed Versions
Prior Performance
No response