vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

Broken pipe error when Groupby on Timestamp for a data with ~200 million rows #2352

Open AarzooDhiman opened 1 year ago

AarzooDhiman commented 1 year ago

The following statement works fine for a sample of rows (let's say 100,000) but when I run it on the whole data (~200 million), I get a broken pipe error, due to excessive usage of CPU and memory.

df2= df.groupby(vaex.BinnerTime.per_week(df.TIMESTAMP)).agg({'index' : 'count'})

The exact error is Errno 32: Broken pipe error from multiple pool worker Process ForkPoolWorker-23:

Additionally, I am seeing the error KeyError: "Unknown variables or column: 'lambda_function(__TIMESTAMP)'". It works fine with the sample data. Is it possible that column TIMESTAMP is creating some issue?

I can solve this issue by splitting the data but is there any other fix that can be used to deal with my whole data at once.