Closed hermidalc closed 1 year ago
In fact for a particular join I'm doing of a df with 3 cols x 250 million rows with a df with 2 cols x 250 millions rows, it's taking forever still running mostly single-threaded for 2 hours. The first df was filtered from a df with 1 billion rows, wondering if that makes a difference. The second df isn't filtered.
When performing a join on two large dataframes (each with only a 2 or 3 columns, but 10s to 100s of millions of rows, and allow_duplication=True), how do I improve Vaex performance? It will sometimes take well over an hour and much of that time Vaex is running single-threaded. Is there a way to improve performance? (like presorting the join column in each separate dataframe, or something else?)