Open mannharleen opened 4 years ago
This seems like a valid problem.
I did a brief benchmark performance by joining 2 dataframes containing 43K rows. The joined columns contain unique values, meaning that there can only be a single match between 1 row in df_A and 1 row in df_B.
The performance of and Inner join for go-gota
was: 37.68s.
In contrast, the same logic, when executed using Pandas
in Python took barely 1s.
From the looks of the present implementation, go-gota
is indeed implementing a nested loop join, which can be inefficient for large datasets.
Can I check if there is a road map to address this issue? If not, would it be possible for me to try and submit a PR with implementations for hash join & merge join features? Believe those will help speed up the performance of joins.
Thanks
Hi,
feel free to open a PR. Thank you for contributing.
Chris
A couple of things:
Looking for inputs here.