I have been working on using this repository as a SOTA baseline for our paper Retrieve, Merge, Predict (https://arxiv.org/abs/2402.06282), and I was having issues running the discovery step because of the set intersection step taking an extremely long time, so long it was impractical to run on our larger data lakes.
I was able to address the issue with the very simple change in the PR, i.e. changing the way the set is built:
# from
seta = set(table_a[col_a])
# to
seta = set(table_a[col_a].unique())
I thought I'd share the change as it is a very small modification that allowed us to run the code in reasonable time on larger data lakes.
I made some substantial changes to the discovery script in the main branch of my fork to ensure it was compatible with my own codebase, but I don't think they would be relevant to this repository, so they're not in this PR.
Thanks for preparing an understandable repository!
Hello!
I have been working on using this repository as a SOTA baseline for our paper Retrieve, Merge, Predict (https://arxiv.org/abs/2402.06282), and I was having issues running the discovery step because of the set intersection step taking an extremely long time, so long it was impractical to run on our larger data lakes.
I was able to address the issue with the very simple change in the PR, i.e. changing the way the set is built:
I thought I'd share the change as it is a very small modification that allowed us to run the code in reasonable time on larger data lakes.
I made some substantial changes to the discovery script in the main branch of my fork to ensure it was compatible with my own codebase, but I don't think they would be relevant to this repository, so they're not in this PR.
Thanks for preparing an understandable repository!