vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

[BUG-REPORT] #2309

Open MPanek6 opened 1 year ago

MPanek6 commented 1 year ago

Description I'm currently working with a large dataset (1.6mil) read in from a hdf5 file via vx.open() which i'm then applying columns filters to, as well as a column selection. However, when performing aggregations I get a MemoryError: bad allocation error Filtering done via: df = df[df[filter_col].isin(filter_values)] Column selection done via: df = df[list_of_columns] Aggregation done via: df = df.groupby(by=group_by_cols, agg={col: vx.agg.sum(col) for col in cols_to_aggregate}

All of this works perfectly fine when working with a smaller datatset (500k rows). Additionally, first applying the column selection, then aggregations, and finally filters also works fine however this seems to hinder performance a good amount as aggregations are performed on the whole df as a pose to a filtered down version.

Essentially this issue seems to occur when attempting to perform aggregations on a large df thats been filtered beforehand

Software information