intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Support for configuring abbreviated names #37

Closed chrisco484 closed 3 years ago

chrisco484 commented 3 years ago

We have a situation where regular abbreviations of names is not matching the proper name and not sure if that is supported or not.

For example:

'Barney' - an abbreviation of 'Barnaby' will not match with 'Barnaby'.

If not currently supported I was wondering if it's worth considering supporting it by allowing an app to supply a map of abbreviations to known names so that if, say, all other matching fails then if any 'NAME' type components of the match document contain abbreviations in the map then a substitution could be made with the proper name and a match reattempted.

manishobhatia commented 3 years ago

Hi Christopher ,

This library makes use of Apache Soundex internally to find out matches in names .

If you have names that do not confirm to soundex matching algorithms we could override it with a different algorithm. For example EMAIL type uses a technique to break down the words in tri-grams and find similar grams . This can be easily overridden for NAME type

Unfortunately giving a dictionary of known names that match is not supported current. And it might run into performance issues where each word will have do this and cause a quadratic performance characteristic

Hope this helps

Thanks, Manish

On Sep 19, 2020, at 6:55 PM, Christopher Colemani notifications@github.com wrote:

 We have a situation where regular abbreviations of names is not matching the proper name and not sure if that is supported or not.

For example:

'Barney' - an abbreviation of 'Barnaby' will not match with 'Barnaby'.

If not currently supported I was wondering if it's worth considering supporting it by allowing an app to supply a map of abbreviations to known names so that if, say, all other matching fails then if any 'NAME' type components of the match document contain abbreviations in the map then a substitution could be made with the proper name and a match reattempted.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

chrisco484 commented 3 years ago

If you have names that do not confirm to soundex matching algorithms

Yes, many nicknames are often abbreviations or shortenings of the full name and therefore don't trigger a soundex match.

Unfortunately giving a dictionary of known names that match is not supported current. And it might run into performance issues where each word will have do this and cause a quadratic performance characteristic

Ah yes, I can understand how that would affect performance.

In our immediate scenario the list of candidate names is typically <10 but I can see how performance would be an issue for lists with thousands candidate names.

For our case I guess we can do this within our app rather easily: We would simply keep a bidirectional mapping of real names to nicknames and on a failed initial match: We check if any name components have variants in the map and, if so, substitute the variant and then retry the matching.

BTW thanks for creating such an excellent library - name matching without fuzziness is as painful as teeth pulling! :)

manishobhatia commented 3 years ago

closing the issue. Feel free to open a new one, if you things the issue was not resolved , or would like to see some enhancement to the library