vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Aggregation using "list" eats all memory and crashes kernel #2246

Closed LiamNiisan closed 1 year ago

LiamNiisan commented 1 year ago

Description After using group by on a larger than memory text file (.hd5, ~40GB), I tried to do aggregation using list, the progress bar gets to around 15% and the memory gets full and the kernel crashes usually. I tried that many times, changing the query a bit (agg inside group by, list or vaex.agg.list and tried to change the params as well), also restarted the computer, the results are consistent.

The code looks like this:

import vaex

df = vaex.open("text-file.hd5")

with vaex.progress.tree("rich", title="My Vaex Computations"):
     df = df.groupby(["id"]).agg({"text_col" : "list"})

Software information

JovanVeljanoski commented 1 year ago

Hi, thanks for the report.

Can you provide some more info about the dataset. Something like:

The groupby of vaex is out of core, but the resulting aggregated dataframe is in memory. How much available ram do you have and can you estimate the size of the output?

We might need to recreate equivalent dataset to debug this..

LiamNiisan commented 1 year ago

Number of rows: 300,000,000 After aggregation it should go down to around 20,000,000 rows For the columns, just a text column and another id column I wanted to use group by on.

I have something like 16GB of ram available, since the dataset is around 40GB, I can't hold it in memory.

Is there a way to do aggregations out of memory?

JovanVeljanoski commented 1 year ago

Ah ok I think i understand what is happening. There is no bug here as far as I can tell.

You in your case.. it is kind of like you are doing "reshaping" of the data rather than a standard aggregation. So your full dataset is ~40GB, and since you only have 2 columns (id + text data), most of the data-size lies in the text column.

So when you do aggregation, you collapse the id column, but the total size of the text data remains the same.

Usually, when doing aggregations we reduce the data (i.e. doing a mean will result in a single number out of many samples), but with the list aggregator, we still get 1 sample per group, but in the textual case this sample can be huge (like in your case).

So while the groupby is done out of core, the aggregation can in principle be also done out of core, but we need to store the result somewhere, and the result is currently stored/put in memory. In this case the output of the groupby is too large for your memory capacity, which is why it fails. But this is expected behavior, unfortunately.

We have had various discussions in the past on how we can improve this (to also allow for fully custom aggregators, etc..). There are some ideas (more like thought experiments at this point). But to take this on seriously we would require dedicated time and funding, since it is a rather non-trivial problem as far as i can tell... (if you are interested in that contact us via vaex.io).

I hope the above gives a bit of a background on what is happening.

As a possible workaround.. i wonder if exporting the data in a smart-ish way, say one file per group (that should be easy just filter by group and export). Then use some linux terminal commands to put the text column into a single row, with a prefix which is the group name (index whatever). Perhaps some magic with sed or awk is possible there. Then it is easy to join the separate csv files into one. And if needed, you can continue working with vaex after that, since vaex can read csv files in an out of core manner now.

Perhaps there is a better way than what i suggest above..

LiamNiisan commented 1 year ago

Yeah that makes sense, thank you for the quick response!