megagonlabs / starmie

Resources for PVLDB 2023 submission
16 stars 5 forks source link

Small changes to optimize the `check_table_pair` function #4

Open rcap107 opened 1 month ago

rcap107 commented 1 month ago

Hello!

I have been working on using this repository as a SOTA baseline for our paper Retrieve, Merge, Predict (https://arxiv.org/abs/2402.06282), and I was having issues running the discovery step because of the set intersection step taking an extremely long time, so long it was impractical to run on our larger data lakes.

I was able to address the issue with the very simple change in the PR, i.e. changing the way the set is built:

# from
        seta = set(table_a[col_a])
# to
        seta = set(table_a[col_a].unique())

I thought I'd share the change as it is a very small modification that allowed us to run the code in reasonable time on larger data lakes.

I made some substantial changes to the discovery script in the main branch of my fork to ensure it was compatible with my own codebase, but I don't think they would be relevant to this repository, so they're not in this PR.

Thanks for preparing an understandable repository!