pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.58k stars 17.57k forks source link

PERF: df.unstack() is 500 times slower since pandas>=2.1 #58391

Open sbonz opened 2 months ago

sbonz commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd import numpy as np import time df = pd.DataFrame(np.random.random(size=(10000, 100))) st = time.time() df.unstack() # this operation takes 500x more in pandas>=2.1 print(f"time {time.time() -st}")

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252 pandas : 2.2.1 numpy : 1.26.4 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Prior Performance

same code as above is 500x faster for pandas<=2.0.3. Issue happens on Windows and Linux, with Python 3.10 and 3.12, with backend numpy and pyarrow. The slow down seems to be in the stack_v3 function in the initial loop.

jbrockmendel commented 2 months ago

Cc @rhshadrach

asishm commented 2 months ago

on main it's about 5x faster than on 2.2.2 but still extremely slow compared to 2.0.3

on 2.0.3 -> 17ms 2.2.2 -> 5.4 s main -> 1.08 s

sam-baumann commented 2 months ago

take

sam-baumann commented 2 months ago

Looked into this. In the sample code from the original issue, the df being used for testing is just random values, rather than the result of a stack(). The following code actually runs 2-3x faster on main than 2.0.3 on my machine.

Seems like the performance issue only comes up when the df is not in the form expected by unstack(). @sbonz did you see this on real data?

import pandas as pd
import numpy as np
import time
data = np.random.randint(0, 100,size=(100000, 1000))
df = pd.DataFrame(data=data).stack()

st = time.time()
df.unstack() 
print(f"time {time.time() -st}")
sbonz commented 2 months ago

@sam-baumann yes, I noticed the slowdown because some tests (with real data) in our pipeline started timing out.

sbonz commented 1 month ago

@sam-baumann I was wondering if you have had any chance to look at this?

sam-baumann commented 1 month ago

Hi @sbonz - I did look a bit further into this - I think I may have to remove myself from this issue because I don't think I'm familiar enough with this part of the codebase to be of much more help here. Sorry!

mroeschke commented 1 month ago

@rhshadrach would it make sense to carve out a fastpath in stack_v3 for a homogenously typed DataFrame with unique columns to just do frame._values.ravel()?

rhshadrach commented 1 month ago

Yea - I think adding a fastpath makes sense. I'm going to make an attempt shortly.

jbrockmendel commented 1 month ago

just do frame._values.ravel()?

Just a note on this: consider arr.reshape(-1) since ravel can make a copy in some cases.