Description
I'm currently working with a large dataset (1.6mil) read in from a hdf5 file via vx.open() which i'm then applying columns filters to, as well as a column selection. However, when performing aggregations I get a MemoryError: bad allocation error
Filtering done via:
df = df[df[filter_col].isin(filter_values)]
Column selection done via:
df = df[list_of_columns]
Aggregation done via:
df = df.groupby(by=group_by_cols, agg={col: vx.agg.sum(col) for col in cols_to_aggregate}
All of this works perfectly fine when working with a smaller datatset (500k rows).
Additionally, first applying the column selection, then aggregations, and finally filters also works fine however this seems to hinder performance a good amount as aggregations are performed on the whole df as a pose to a filtered down version.
Essentially this issue seems to occur when attempting to perform aggregations on a large df thats been filtered beforehand
Software information
Vaex version (import vaex; vaex.__version__): 4.16.0
Description I'm currently working with a large dataset (1.6mil) read in from a hdf5 file via
vx.open()
which i'm then applying columns filters to, as well as a column selection. However, when performing aggregations I get aMemoryError: bad allocation
error Filtering done via:df = df[df[filter_col].isin(filter_values)]
Column selection done via:df = df[list_of_columns]
Aggregation done via:df = df.groupby(by=group_by_cols, agg={col: vx.agg.sum(col) for col in cols_to_aggregate}
All of this works perfectly fine when working with a smaller datatset (500k rows). Additionally, first applying the column selection, then aggregations, and finally filters also works fine however this seems to hinder performance a good amount as aggregations are performed on the whole df as a pose to a filtered down version.
Essentially this issue seems to occur when attempting to perform aggregations on a large df thats been filtered beforehand
Software information
import vaex; vaex.__version__)
: 4.16.0