pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

PERF: Huge regression in `groupby` + `sum` in `dtype_backend == 'pyarrow'`. #53737

Closed pstorozenko closed 1 year ago

pstorozenko commented 1 year ago

Pandas version checks

Reproducible Example

There's a huge regression (70x) in groupby + sum when using pyarrow as backend. The file used: wiki100.zip is a small subset of wiki clickstream dataset for March 2022. This zip contains a parquet single file. The regression is the same when working with larger subset when it takes minutes to run pyarrow, compared to seconds in numpy backed Series.

import pandas as pd

wiki100_np = pd.read_parquet("wiki100.parquet")
wiki100_pa = pd.read_parquet("wiki100.parquet", dtype_backend='pyarrow')
wiki100_pa.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8773 entries, 19883848 to 2047474
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   prev    8773 non-null   string[pyarrow]
 1   curr    8773 non-null   string[pyarrow]
 2   type    8773 non-null   string[pyarrow]
 3   n       8773 non-null   int64[pyarrow] 
dtypes: int64[pyarrow](1), string[pyarrow](3)
memory usage: 576.7 KB
%%timeit
(
    wiki100_np
    .groupby("curr")
    ['n'].sum()
)
# 6.41 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
(
    wiki100_pa
    .groupby("curr")
    ['n'].sum()
)
# 449 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

P.S. Resetting index didn't change anything.

P.S.2 Regression stays the same if I run

(
    wiki100_pa.reset_index(drop=True)
    .groupby("curr")
    .agg({'n': 'sum'})
)

instead.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 965ceca9fd796940050d6fc817707bba1c4f9bff python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-43-generic Version : #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.2 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.3.3 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 8.14.0 pandas_datareader: None bs4 : None bottleneck : 1.3.7 brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : 0.57.0 numexpr : 2.8.4 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

No response

samukweku commented 1 year ago

I dont think groupby operations based on pyarrow has been implemented. I'd expect that it would probably be passed on to the Arrow Table implementation.

mroeschke commented 1 year ago

Thanks but this is a duplicate of https://github.com/pandas-dev/pandas/issues/52070.

pandas does not dispatch arrow dtypes to Arrow's groupby implementation yet but will probably come in the next few releases.