[Question] Additional Distance Metrics

Levenshtein, Jaro and multiple ratio distances of the rapidfuzz library are already used by default as input features by the supervised model. See: https://github.com/ing-bank/EntityMatchingModel/blob/main/emm/features/pandas_feature_extractor.py#L101

In principle I'm okay with adding other distances as well, but the added value needs to be shown first. Especially b/c there are strong correlations between these features, and - as it turns out - in practice we see that the string-distance features we use now (on top of cosine similary values from the indexers) do not add a lot of extra discrimination power, but are expensive to calculate. (fuzz.ratio() is okay, but the others are not adding much.)

Btw the class PandasFeatureExtractor is part of the supervised model pipeline, which is also used in SparkEntityMatching pipeline (and gets called as a pandas udf).

ing-bank / EntityMatchingModel

[Question] Additional Distance Metrics #15