intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

#35 fix DateMatch with NeighborhoodRange greater than 0.91 failing #39

Closed aavaas closed 3 years ago

aavaas commented 3 years ago

Using scale factor of 1.1 for dates causes percent values >0.91 to result in final percentage values >1 which is invalid (0.91 *1.1 = 1.001). It is also counter intuitive as neighborhood_range closer to 1.0 should mean extremely close to equality, but right now 0.90 is translated to 0.99 with the scaling factor thus, currently, 0.9 corresponds to maximum valid value for date type.

Solution: a) Remove DATE_SCALE_FACTOR as it is not intuitive, and it's effect can by achieved by the end user with higher neighborhood_range .

b) Default neighborhood_range of 0.9 for dates is around 1800 (5 years) which is is probably too much for day to day matching, so I changed the default to 0.99. This also maintains the current behavior for dates, as 0.9 was scaled by 1.1 to 0.99, thus existing tests pass without changes.

manishobhatia commented 3 years ago

Hi @aavaas , thanks for taking the time for this pull request.

I understand that DATE_SCALE_FACTOR is not intuitive , but the reason for doing that is the end-user should not be tuning different values for different data types. With this solution a NUMBER type will have a different scale and DATE will have its own. As we keep adding more data type, the end-user will have to be intimately familiar with these difference to make use of this effectively.

Is there another way to solve this where we can maintain a similar scale for NUMBER , DATE and other such data types ?

aavaas commented 3 years ago

Hi, @manishobhatia , I've incorporated ranged percentage as used by Age datatype into Date datatype. I've retroactively set the range as 15777e7D, such that the current default neighborhood range of 0.9 for dates around 2020 maintain the same result. Luckily, that value results in the effective range of 5 years for Dates, which I believe should be enough.

let me know if this looks good. Thx!