Open sbonz opened 2 months ago
Cc @rhshadrach
on main it's about 5x faster than on 2.2.2 but still extremely slow compared to 2.0.3
on 2.0.3 -> 17ms 2.2.2 -> 5.4 s main -> 1.08 s
take
Looked into this. In the sample code from the original issue, the df being used for testing is just random values, rather than the result of a stack(). The following code actually runs 2-3x faster on main than 2.0.3 on my machine.
Seems like the performance issue only comes up when the df is not in the form expected by unstack(). @sbonz did you see this on real data?
import pandas as pd
import numpy as np
import time
data = np.random.randint(0, 100,size=(100000, 1000))
df = pd.DataFrame(data=data).stack()
st = time.time()
df.unstack()
print(f"time {time.time() -st}")
@sam-baumann yes, I noticed the slowdown because some tests (with real data) in our pipeline started timing out.
@sam-baumann I was wondering if you have had any chance to look at this?
Hi @sbonz - I did look a bit further into this - I think I may have to remove myself from this issue because I don't think I'm familiar enough with this part of the codebase to be of much more help here. Sorry!
@rhshadrach would it make sense to carve out a fastpath in stack_v3
for a homogenously typed DataFrame with unique columns to just do frame._values.ravel()
?
Yea - I think adding a fastpath makes sense. I'm going to make an attempt shortly.
just do frame._values.ravel()?
Just a note on this: consider arr.reshape(-1)
since ravel can make a copy in some cases.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import pandas as pd import numpy as np import time df = pd.DataFrame(np.random.random(size=(10000, 100))) st = time.time() df.unstack() # this operation takes 500x more in pandas>=2.1 print(f"time {time.time() -st}")
Installed Versions
Prior Performance
same code as above is 500x faster for pandas<=2.0.3. Issue happens on Windows and Linux, with Python 3.10 and 3.12, with backend numpy and pyarrow. The slow down seems to be in the stack_v3 function in the initial loop.