vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 589 forks source link

[BUG-REPORT] Large Groupby Agg runs out of memory #2400

Open meta-ks opened 8 months ago

meta-ks commented 8 months ago

Description First thank you guys for this wonderful library. It does many pd operations pretty well given mem constraints (except maybe cumsum() which i am eagerly waiting.) I have a arrow file ~8GB which i load in vaex df of shape: (27_416_244, 32). System avlbl RAM: ~8GB. I do a group_agg like this:

#summary_df is a multi index pandas df with 76k rows, 20 cols
index_names = list(summary_df.index.names)
strfmt = '%Y-%m-%d'
vdf['_Period'] = vdf['Date'].dt.strftime(strfmt)

gd_column_ops_map = {
    'PnL % Capital':'sum', 'PnL':'sum', '% High':'mean',
    '% Close':'mean', '% Low':'mean', 'Charges':'sum', 'Sell Val':'sum', 'Buy Val':'sum',
    'Qty':'sum', 'Cash Flow':'sum'
}
grpby_cols = index_names + ['_Period']

>> [Kernel CRASHES in next line after grpby happens perhaps in agg]
 grp_trades_vdf = vdf.groupby(grpby_cols, progress=True).agg(gd_column_ops_map)

Software information