intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

ScoringFunctions cleanup #71

Closed romanblum closed 1 year ago

romanblum commented 1 year ago

There are two main issues w/the scoring functions as they exist in the original project

  1. There are a variety of functions unaccessible from calling clients, there's no functionality to set custom scoring functions so why define them
  2. The default scoring functions are broken. They claim to be "weighted" but only consider weights for matched elements. Any unmatched element is assumed to have weight 1. This can produce inaccurate results in a variety of cases, the most obvious is matching on two elements:

e1: NAME, weight: .01 e2: ADDRESS, weight: .1

If we assume two documents: d1 w/ ADDRESS = "123 Main st" d2 w/ ADDRESS = "123 Main st" NAME = "John Doe" The old getWeightedAverageScore would have = (.1 + .5) / (.1 + 2 - 1) = .5454 The new getWeightedAverageScore would have = (.1 + .005) / (.11) = .95454

The minimally weighted name is disproportionately pulling down the score