Improving dataframe join performance

mannharleen commented 4 years ago

A couple of things:

add benchmark test for dataframe join
currently only supports nestedLoop join. Implement other algo e.g. HashJoin or may be even Merge?

Looking for inputs here.

JKOK005 commented 3 years ago

This seems like a valid problem.

I did a brief benchmark performance by joining 2 dataframes containing 43K rows. The joined columns contain unique values, meaning that there can only be a single match between 1 row in df_A and 1 row in df_B.

The performance of and Inner join for go-gota was: 37.68s.

In contrast, the same logic, when executed using Pandas in Python took barely 1s.

From the looks of the present implementation, go-gota is indeed implementing a nested loop join, which can be inefficient for large datasets.

Can I check if there is a road map to address this issue? If not, would it be possible for me to try and submit a PR with implementations for hash join & merge join features? Believe those will help speed up the performance of joins.

Thanks

chrmang commented 3 years ago

Hi,

feel free to open a PR. Thank you for contributing.

Chris

go-gota / gota

Improving dataframe join performance #110