rcap107 / embdi

EmbDI is a table embeddings algorithm that solves data integration problems by converting tabular data into graphs, then applying word2vec to the graph to obtain embeddings.
Apache License 2.0
7 stars 1 forks source link

Fixing imports #9

Open rcap107 opened 6 months ago

rcap107 commented 6 months ago

Imports in EmbDI are a mess, mostly because of the data preprocessing package. I should rewrite it so that they aren’t an issue anymore.

The problematic imports are similarity and datasketch.

datasketch is available on conda, but not from the main repository. similarity instead is a random pip package with the levenshtein distance function. datasketch is used to work with a MinHASH encoder, which is also implemented by [dirtycat](https://dirty-cat.github.io/stable/generated/dirty_cat.MinHashEncoder.html#dirty_cat.MinHashEncoder), so maybe it should be reimplemented in that way.

Missing packages with sources: