rcap107 / embdi

EmbDI is a table embeddings algorithm that solves data integration problems by converting tabular data into graphs, then applying word2vec to the graph to obtain embeddings.
Apache License 2.0
7 stars 1 forks source link

Performance improvements in edgelist generation #2

Closed Tratori closed 7 months ago

Tratori commented 7 months ago

Hello There,

We currently use EmbDI in Schema-matching and encountered performance issues when matching large datasets with big intersections (~30k rows ~150 columns 2 and ~60k intersections). During edgelist generation, it needs to be checked whether df values are in the intersection. Currently, the intersection is stored as a list, leading to a somewhat quadratic runtime (~9 million iterations over the whole 60k list).

Changing the intersection to a set and avoiding exception generation by using a type check dramatically reduces the time spent in that step (> 6 hours -> < 2 minutes).

Greetings from Potsdam

rcap107 commented 7 months ago

Hey @Tratori, thank you very much for checking out the code and for the fix. I'm expecting there will be more places in the code that require similar fixes, unfortunately I never had the time to optimize it properly.