ing-bank / EntityMatchingModel

Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
https://entitymatchingmodel.readthedocs.io/en/latest/
MIT License
52 stars 4 forks source link

[Question] Additional Distance Metrics #15

Closed Mpicca closed 5 months ago

Mpicca commented 6 months ago

Will there be more distance metrics added such as Levenshtein, Iterative Substring, Double Metaphone..etc I prefer to leverage this library due to spark language compatibility and it's speed but wish we had more distance metrics like in the name_matching library from DeNederlandscheBank

mbaak commented 6 months ago

Levenshtein, Jaro and multiple ratio distances of the rapidfuzz library are already used by default as input features by the supervised model. See: https://github.com/ing-bank/EntityMatchingModel/blob/main/emm/features/pandas_feature_extractor.py#L101

In principle I'm okay with adding other distances as well, but the added value needs to be shown first. Especially b/c there are strong correlations between these features, and - as it turns out - in practice we see that the string-distance features we use now (on top of cosine similary values from the indexers) do not add a lot of extra discrimination power, but are expensive to calculate. (fuzz.ratio() is okay, but the others are not adding much.)

Btw the class PandasFeatureExtractor is part of the supervised model pipeline, which is also used in SparkEntityMatching pipeline (and gets called as a pandas udf).