The following statement works fine for a sample of rows (let's say 100,000) but when I run it on the whole data (~200 million), I get a broken pipe error, due to excessive usage of CPU and memory.
The exact error is Errno 32: Broken pipe error from multiple pool worker Process ForkPoolWorker-23:
Additionally, I am seeing the error KeyError: "Unknown variables or column: 'lambda_function(__TIMESTAMP)'". It works fine with the sample data. Is it possible that column TIMESTAMP is creating some issue?
I can solve this issue by splitting the data but is there any other fix that can be used to deal with my whole data at once.
The following statement works fine for a sample of rows (let's say 100,000) but when I run it on the whole data (~200 million), I get a broken pipe error, due to excessive usage of CPU and memory.
df2= df.groupby(vaex.BinnerTime.per_week(df.TIMESTAMP)).agg({'index' : 'count'})
The exact error is
Errno 32: Broken pipe error
from multiple pool workerProcess ForkPoolWorker-23:
Additionally, I am seeing the error
KeyError: "Unknown variables or column: 'lambda_function(__TIMESTAMP)'"
. It works fine with the sample data. Is it possible that columnTIMESTAMP
is creating some issue?I can solve this issue by splitting the data but is there any other fix that can be used to deal with my whole data at once.