intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Nicknames / different name spellings #45

Closed antongray closed 3 years ago

antongray commented 3 years ago

In using your software to match people's names, I'm wondering how best to handle nicknames (e.g., Jim and James, Bill and William) and different spellings of the same name that are phonetically identical (e.g., Chris / Kris) -I'd actually expected Kris / Chris to match with soundex, but apparently not. Ideally Bill Smith would be a match for William Smith, and vice versa. I could handle this by updating the PreProcessing or Tokenizer function, but I don't want to go re-inventing the wheel if you already have a better way of handling this, or plan to implement something soon. Thanks, Anton.

manishobhatia commented 3 years ago

Hi Anton,

The soundex match does work for most of the phonetically similar name. You should see names like Smith / Smythe or John / Jon match fine. The examples you gave I don't see it being supported by Soundex unfortunately and I am not sure if there is a better library that does support it. If you do find it, there is an option to override the Tokenizer function and make use of it.

Another option is to make use of a PreProcessing dictionary for Names. There is a mapping file (name-dictonary) within the library which is mainly concerned with removing prefix, postfix and salutations from a name. This can be re-purposed to give the mapping for such nickname's.

Here is a test example to provide this mapping externally.

But beyond this in my experience if you have other attributes of a person , like address, phone numbers or emails . This discrepancy in the first name should not significantly impact the overall score of finding similarity.

Hope that helps

Thanks, Manish

antongray commented 3 years ago

Hi Manish,

Thanks for getting back to me so quickly. After I raised this issue and had more of a think about it, I basically realized all the things that you pointed out to me - that I could use the names dictionary, for example, but that this might raise more issues than it solves, and I should just rely on the overall data set matching rather than obsess on the first name.

Overall the library seems to work really well for me, so I think I'd just close my issue if I were you and I'll work with what I have.

Thanks, Anton.

On Wed, Feb 10, 2021, 12:38 PM Manish Bhatia notifications@github.com wrote:

Hi Anton,

The soundex match does work for most of the phonetically similar name. You should see names like Smith / Smythe or John / Jon match fine. The examples you gave I don't see it being supported by Soundex unfortunately and I am not sure if there is a better library that does support it. If you do find it, there is an option to override the Tokenizer function and make use of it.

Another option is to make use of a PreProcessing dictionary for Names. There is a mapping file (name-dictonary https://github.com/intuit/fuzzy-matcher/blob/master/src/main/resources/address-dictionary.txt) within the library which is mainly concerned with removing prefix, postfix and salutations from a name. This can be re-purposed to give the mapping for such nickname's.

Here is a test example https://github.com/intuit/fuzzy-matcher/blob/97a2fc37c78d31879451e9953c484cd3e0a8dce2/src/test/java/com/intuit/fuzzymatcher/component/MatchServiceTest.java#L464 to provide this mapping externally.

But beyond this in my experience if you have other attributes of a person , like address, phone numbers or emails . This discrepancy in the first name should not significantly impact the overall score of finding similarity.

Hope that helps

Thanks, Manish

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intuit/fuzzy-matcher/issues/45#issuecomment-776886801, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYNC427UUA5NQDVHL2YTYTS6K777ANCNFSM4XLX5FEA .

antongray commented 3 years ago

Hi Manish,

I have a quick follow up question for you. We mostly work in .net, and I've been asked about running your library in a .net solution. I've looked at ikvm for this, to convert the jar to a dll, but this throws errors ar me. After investigating, it looks like ikvm only works up to Java v1.7, and fuzzy Matcher is built in 1.8. Is this correct? Do you know if I could rebuild the source in v1.7? Do you have any other suggestions?

Thanks, Anton

On Wed, Feb 10, 2021, 12:58 PM anton gray anthonyjgray76@gmail.com wrote:

Hi Manish,

Thanks for getting back to me so quickly. After I raised this issue and had more of a think about it, I basically realized all the things that you pointed out to me - that I could use the names dictionary, for example, but that this might raise more issues than it solves, and I should just rely on the overall data set matching rather than obsess on the first name.

Overall the library seems to work really well for me, so I think I'd just close my issue if I were you and I'll work with what I have.

Thanks, Anton.

On Wed, Feb 10, 2021, 12:38 PM Manish Bhatia notifications@github.com wrote:

Hi Anton,

The soundex match does work for most of the phonetically similar name. You should see names like Smith / Smythe or John / Jon match fine. The examples you gave I don't see it being supported by Soundex unfortunately and I am not sure if there is a better library that does support it. If you do find it, there is an option to override the Tokenizer function and make use of it.

Another option is to make use of a PreProcessing dictionary for Names. There is a mapping file (name-dictonary https://github.com/intuit/fuzzy-matcher/blob/master/src/main/resources/address-dictionary.txt) within the library which is mainly concerned with removing prefix, postfix and salutations from a name. This can be re-purposed to give the mapping for such nickname's.

Here is a test example https://github.com/intuit/fuzzy-matcher/blob/97a2fc37c78d31879451e9953c484cd3e0a8dce2/src/test/java/com/intuit/fuzzymatcher/component/MatchServiceTest.java#L464 to provide this mapping externally.

But beyond this in my experience if you have other attributes of a person , like address, phone numbers or emails . This discrepancy in the first name should not significantly impact the overall score of finding similarity.

Hope that helps

Thanks, Manish

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intuit/fuzzy-matcher/issues/45#issuecomment-776886801, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYNC427UUA5NQDVHL2YTYTS6K777ANCNFSM4XLX5FEA .

manishobhatia commented 3 years ago

Hi Anton,

Having the library support 1.7 will be a little tricky. Most of the code uses the functional paradigm which was introduced in Java 1.8 and also stream processing that allows parallel processing to solve for large datasets.

I do know of a few implementation where this library is being used with .NET. I believe they used JNI bridge to interact with the interfaces from fuzzy-matcher

If you have a jvm environment to run the jar , you only need to create proxies for a few classes that this library exposes as public methods. Like the MatchService, which is the entry point ... and most of the classes in domain which you will need to send and receive the results

These are simple java classes without any java 1.8 features in them.

Thanks, Manish

antongray commented 3 years ago

Thanks!

On Fri, Feb 12, 2021, 3:23 PM Manish Bhatia notifications@github.com wrote:

Hi Anton,

Having the library support 1.7 will be a little tricky. Most of the code uses the functional paradigm which was introduced in Java 1.8 and also stream processing that allows parallel processing to solve for large datasets.

I do know of a few implementation where this library is being used with .NET. I believe they used JNI bridge to interact with the interfaces from fuzzy-matcher

If you have a jvm environment to run the jar , you only need to create proxies for a few classes that this library exposes as public methods. Like the MatchService, which is the entry point ... and most of the classes in domain which you will need to send and receive the results

These are simple java classes without any java 1.8 features in them.

Thanks, Manish

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intuit/fuzzy-matcher/issues/45#issuecomment-778434191, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYNC45BFKRQI5ZMEXK5S73S6WE3VANCNFSM4XLX5FEA .

manishobhatia commented 3 years ago

closing the issue. Feel free to open a new one if there are still questions