Closed Mpicca closed 5 months ago
Levenshtein, Jaro and multiple ratio distances of the rapidfuzz library are already used by default as input features by the supervised model. See: https://github.com/ing-bank/EntityMatchingModel/blob/main/emm/features/pandas_feature_extractor.py#L101
In principle I'm okay with adding other distances as well, but the added value needs to be shown first. Especially b/c there are strong correlations between these features, and - as it turns out - in practice we see that the string-distance features we use now (on top of cosine similary values from the indexers) do not add a lot of extra discrimination power, but are expensive to calculate. (fuzz.ratio() is okay, but the others are not adding much.)
Btw the class PandasFeatureExtractor is part of the supervised model pipeline, which is also used in SparkEntityMatching pipeline (and gets called as a pandas udf).
Will there be more distance metrics added such as Levenshtein, Iterative Substring, Double Metaphone..etc I prefer to leverage this library due to spark language compatibility and it's speed but wish we had more distance metrics like in the name_matching library from DeNederlandscheBank