[BUG-REPORT] Incredibly slow performance with shifted columns

vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

https://vaex.io

MIT License

8.27k stars 590 forks source link

[BUG-REPORT] Incredibly slow performance with shifted columns #2338

Open ceberhardt opened 1 year ago

ceberhardt commented 1 year ago

Description Hi! I'm not sure if this is a use case addressed by vaex, but the following example has an incredibly slow performance:

import pandas as pd
import vaex
import numpy as np

N=2**26
df = pd.DataFrame()
array = np.linspace(1,N,N).reshape(-1, 2**6)

for i in range(2**6):
    df[f'col_{i}'] = array[:,i]

vaex_df = vaex.from_pandas(df)
for col in vaex_df.get_column_names():
    vaex_df[f"{col}_shift_1"] = vaex_df[col]
    vaex_df.shift(periods=1, column=f"{col}_shift_1", inplace=True)

Fast so far...

But accessing one row:

%timeit vaex_df[10]

10.8 s ± 81.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

... not anymore.

Appreciate any help on the issue!

Software information

Vaex version (import vaex; vaex.__version__):

{'vaex-core': '4.16.1',
'vaex-viz': '0.5.4',
'vaex-hdf5': '0.14.1',
'vaex-server': '0.8.1',
'vaex-jupyter': '0.8.1',
'vaex-ml': '0.18.1'}

Vaex was installed via: pip
OS: macOS Ventura 13.2 / Apple Silicon

Ben-Epstein commented 1 year ago

@ceberhardt hello! I'm not sure exactly why this is happening (since i'd expect this to have been optimized by vaex under the hood), but you are calling shift in a suboptimal way for your use case.

if you update your code slightly to only call shift once, it's dramatically faster

import pandas as pd
import vaex
import numpy as np

N=2**26
df = pd.DataFrame()
array = np.linspace(1,N,N).reshape(-1, 2**6)

for i in range(2**6):
    df[f'col_{i}'] = array[:,i]

vaex_df = vaex.from_pandas(df)
shift_cols = []
for col in vaex_df.get_column_names():
    vaex_df[f"{col}_shift_1"] = vaex_df[col]
    shift_cols.append(f"{col}_shift_1")

vaex_df = vaex_df.shift(periods=1, column=shift_cols)

Also, when you want to test the speed of vaex, I don't recommend indexing like that, as it's not doing what you think it is. Instead, i'd suggest something like x = df[:10].to_records() - or similar. That will materialize the first 10 rows, for example.

Nonetheless, here are my results

maartenbreddels commented 1 year ago

Hah, you're becoming a vaex-pert @Ben-Epstein !

ceberhardt commented 1 year ago

@Ben-Epstein Wow, that helped a lot! Thank you very much!

vaex_df[10] is still slowish (200x), but

vaex_df[:10] is as fast as I'd hoped! Thank you vaex creators!

Ben-Epstein commented 1 year ago

@maartenbreddels I'm now curious if vaex could optimize this for the user. If shift is lazy, I don't see why these should have different performances

maartenbreddels commented 1 year ago

Yeah, this isn't nice. Lets keep this open as a reminder.