vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

How to increase join performance #2239

Closed hermidalc closed 1 year ago

hermidalc commented 1 year ago

When performing a join on two large dataframes (each with only a 2 or 3 columns, but 10s to 100s of millions of rows, and allow_duplication=True), how do I improve Vaex performance? It will sometimes take well over an hour and much of that time Vaex is running single-threaded. Is there a way to improve performance? (like presorting the join column in each separate dataframe, or something else?)

hermidalc commented 1 year ago

In fact for a particular join I'm doing of a df with 3 cols x 250 million rows with a df with 2 cols x 250 millions rows, it's taking forever still running mostly single-threaded for 2 hours. The first df was filtered from a df with 1 billion rows, wondering if that makes a difference. The second df isn't filtered.