Open ceberhardt opened 1 year ago
@ceberhardt hello! I'm not sure exactly why this is happening (since i'd expect this to have been optimized by vaex under the hood), but you are calling shift
in a suboptimal way for your use case.
if you update your code slightly to only call shift
once, it's dramatically faster
import pandas as pd
import vaex
import numpy as np
N=2**26
df = pd.DataFrame()
array = np.linspace(1,N,N).reshape(-1, 2**6)
for i in range(2**6):
df[f'col_{i}'] = array[:,i]
vaex_df = vaex.from_pandas(df)
shift_cols = []
for col in vaex_df.get_column_names():
vaex_df[f"{col}_shift_1"] = vaex_df[col]
shift_cols.append(f"{col}_shift_1")
vaex_df = vaex_df.shift(periods=1, column=shift_cols)
Also, when you want to test the speed of vaex, I don't recommend indexing like that, as it's not doing what you think it is. Instead, i'd suggest something like x = df[:10].to_records()
- or similar. That will materialize the first 10 rows, for example.
Nonetheless, here are my results
Hah, you're becoming a vaex-pert @Ben-Epstein !
@Ben-Epstein Wow, that helped a lot! Thank you very much!
vaex_df[10]
is still slowish (200x), but
vaex_df[:10]
is as fast as I'd hoped! Thank you vaex creators!
@maartenbreddels I'm now curious if vaex could optimize this for the user. If shift is lazy, I don't see why these should have different performances
Yeah, this isn't nice. Lets keep this open as a reminder.
Description Hi! I'm not sure if this is a use case addressed by vaex, but the following example has an incredibly slow performance:
Fast so far...
But accessing one row:
%timeit vaex_df[10]
10.8 s ± 81.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
... not anymore.
Appreciate any help on the issue!
Software information
import vaex; vaex.__version__)
: