Performance improvements in edgelist generation

rcap107 / embdi

EmbDI is a table embeddings algorithm that solves data integration problems by converting tabular data into graphs, then applying word2vec to the graph to obtain embeddings.

Apache License 2.0

7 stars 1 forks source link

Hello There,

We currently use EmbDI in Schema-matching and encountered performance issues when matching large datasets with big intersections (~30k rows ~150 columns 2 and ~60k intersections). During edgelist generation, it needs to be checked whether df values are in the intersection. Currently, the intersection is stored as a list, leading to a somewhat quadratic runtime (~9 million iterations over the whole 60k list).

Changing the intersection to a set and avoiding exception generation by using a type check dramatically reduces the time spent in that step (> 6 hours -> < 2 minutes).

Greetings from Potsdam

rcap107 / embdi

Performance improvements in edgelist generation #2