Closed Tratori closed 7 months ago
Hey @Tratori, thank you very much for checking out the code and for the fix. I'm expecting there will be more places in the code that require similar fixes, unfortunately I never had the time to optimize it properly.
Hello There,
We currently use EmbDI in Schema-matching and encountered performance issues when matching large datasets with big intersections (~30k rows ~150 columns 2 and ~60k intersections). During edgelist generation, it needs to be checked whether df values are in the intersection. Currently, the intersection is stored as a list, leading to a somewhat quadratic runtime (~9 million iterations over the whole 60k list).
Changing the intersection to a set and avoiding exception generation by using a type check dramatically reduces the time spent in that step (> 6 hours -> < 2 minutes).
Greetings from Potsdam