vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 589 forks source link

[BUG-REPORT] Joining 2 DataFrames Causing Kernel to die out. [Dataset and Code to replicate Attached] #2045

Open maharshi4697 opened 2 years ago

maharshi4697 commented 2 years ago

These are the 2 datasets that I primarily want to use Vaex for.

sample_scenarios.csv sample_teams.csv


vaex_teams = vaex.from_csv(teams_csv,convert=True, chunk_size=5_000_000)
vaex_scenarios = vaex.from_csv(scenarios_csv,convert=True, chunk_size=5_000_000)

print(len(vaex_teams))
print(len(vaex_scenarios))

print(f'{datetime.now()} Joining')

vaex_joined = vaex_teams.join(vaex_scenarios, on='player__id', allow_duplication=True)

print(len(vaex_joined))

print(f'{datetime.now()} Grouping')

vaex_grouped_by = vaex_joined.groupby(vaex_joined[['team_id', 'scenario_id']], agg=vaex.agg.sum(vaex_joined['total_points']))

print(len(vaex_grouped_by))

print(f'{datetime.now()} Converting')

df = vaex_grouped_by.to_pandas_df(["team_id", "scenario_id", "total_points_sum"])

print(len(df))

print(f'{datetime.now()} Ranking')

df['points_sum_rank'] = df.groupby('scenario_id')['total_points_sum'].rank(ascending=False)

print(f'{datetime.now()} Ending')

It is able to read the Data but Kernel dies when trying to join. I am currently using a 2019 Macbook Pro 16 Gb RAM, but I need to run it in a low cost AWS env.

According to what is advertized, Vex should easily be able to join this data. Am I doing something wrong?

maharshi4697 commented 2 years ago

Update:

When I reduced chuck size it ran.

Screenshot 2022-05-07 at 1 59 19 PM

But as you can see 15 mins is too long. Plus. it took 80 gb of memory at one point.

JovanVeljanoski commented 2 years ago

See https://github.com/vaexio/vaex/issues/2018#issuecomment-1116527477

JovanVeljanoski commented 2 years ago

I looked into your problem a bit..

even with this small sample of data you provide, given the high duplication level of the join indices, it will lead to do ~300M rows dataframe! (one row of which has to be in memory for the current implementation). The fast that this works at all is great if you ask me!

One way to do this would be to do the join, export the dataframe to disk as hdf5, then continue working with it.

maharshi4697 commented 2 years ago

JovanVeljanoski

Apologies for missing out on your comment and thank you for getting back to me super fast! This is not a subset but the entire data. What is confusing to me at this point of time is even though it is a 300 mil row data frame, Vaex is advertized to handle close to a billion rows with ease. So What is the exact problem here?

JovanVeljanoski commented 2 years ago

Vaex can handle as much data as you can hold on disk in principle. So billions of rows indeed. This assumes that the data is in a memory mappable format, and on disk.

When you do join, you are essentially creating a new dataset sort of, and in order to "link" the relation from the "left" to the "right" dataset, we need to create a single column (this is the joining key) which you need to put in memory.

So for example, if you afford to do the joining once, and export the joined dataframe to disk in hdf5/arrow/parquet format and read that, you can do all subsequent operations rather fast.

maartenbreddels commented 2 years ago

I think this can be improved, the current workaround for now might be:

import os
os.environ['VAEX_NUM_THREADS'] = '1'
import vaex

So that we get as little memory usage as possible. I think though, that the memory usage is still excessive, keeping this open.