intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Some issues with the matching #34

Closed vishaln79 closed 3 years ago

vishaln79 commented 4 years ago

Hi Manish,

This is an extremely neat tool you have developed, kudos!! I have been playing around with it and have run into a couple of issues. I was hoping you would help me resolve them.

I have been trying to match a single records against a database of records, and have been going up from 10000 to 500000. This is how I have been configuring the database:

new Document.Builder(csv[0]) .addElement(new Element.Builder().setType(NAME).setValue(getName(csv)).createElement()) .addElement(new Element.Builder().setType(ADDRESS).setValue(getAddress(csv)).createElement()) .addElement(new Element.Builder().setType(PHONE).setValue(csv[8]).setWeight(2) .setThreshold(0.5).createElement()) .addElement(new Element.Builder().setType(EMAIL).setValue(csv[10]).createElement()) .setThreshold(0.5) .createDocument();

But I am seeing some anomalies.

1) I am always seeing one record being returned, whereas I am expecting all records with a threshold greater than 0.5 . And I know that in each case, there are multiple records that should pass the threshold. This is how I am printing the records: result.entrySet().forEach(entry -> { entry.getValue().forEach(match -> { System.out.println("Person searched: " + match.getData() + "\nMatched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult()); }); });

I don't have a unique identifier for each record. this is how my CSV looks like: "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web" Do you think that might be the issue?

2) I am seeing something like this: Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]} Matched With: {[{'Susanna Desiga'}, {'4 W Broad St San Juan Capistrano Orange CA 92675'}, {'susanna@aol.com'}, {'949-622-6261'}]} Score: 0.8668599263800767

Given the only thing remotely in common is the first name, I am wondering why there is such a high matching score. Whereas something like: Person searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]} Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.7142857142857143 which is actually a better match is only getting a score of 0.7142. Other "bad" match, but good score examples: searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]} Matched With: {[{'Susanna Fedak'}, {'4983 Mcallister St Cambridge Middlesex MA 02138'}, {'sfedak@fedak.org'}, {'617-357-4376'}]} Score: 0.7142857142857143 searched: {[{'Susan Smith'}, {'47 Venture Boulevard Somerset NJ'}, {'ssmithers@gmail.com'}, {'7324787395'}]} Matched With: {[{'Susanna Molavi'}, {'17389 Market St #8 Pearl City Honolulu HI 96782'}, {'smolavi@molavi.org'}, {'808-723-3110'}]} Score: 0.7142857142857143

And only these "bad" matches were returned despite the "good" match being present in the database: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.7142857142857143

Any suggestions on how to tune the match database to get better results than this? Would be greatly appreciated!

FYI, the data I am using here is all fake.

Thanks! Vishal.

manishobhatia commented 4 years ago

Hi Vishal, I think the root cause of both issues might be the lack of unique identifier for each record. If you are not able to pull out the unique id from the db record. I would recommend creating one while generating the Document object

There are some examples of generating one in the unit test. https://github.com/intuit/fuzzy-matcher/blob/d2ce6f6f53a2ea5b1d628bd2fb0aec5d1d22bc5a/src/test/java/com/intuit/fuzzymatcher/component/MatchServiceTest.java#L363

The library internally uses this is many places , and I have a hunch that it is also causing bad matches to surface as well.

hope this helps

Thanks, Manish

vishaln79 commented 4 years ago

Thanks @manishobhatia I actually tried this method out earlier and that worked and I saw multiple results as well. But I am still seeing some interesting results, and it may be a question of configuration. For example:

Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]} Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.5357142857142857 Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]} Matched With: {[{'Susan Smith'}, {'2 88th St Somerville Somerset NJ 08876'}, {'ssmith@curb.org'}, {'908-765-1239'}]} Score: 0.5857864376269051

since phone number has been given a greater weight, I am surprised the second match has a higher score. What do you think might be happening? Is it because the name match is better?

Also do you have any suggestions on making the address search a bit smarter? Where the zip code can act as a proxy for the City, State?

vishaln79 commented 4 years ago

Also, would you be able to accommodate requests to add elements for approximate age (I guess we can still use the NUMBER element for this) and gender/sex? Thanks Manish!

manishobhatia commented 3 years ago

Vishal,

On the request to add additional elements, yes absolutely . We are always looking to enhance this with new elements. Your usage of NUMBER for age is correct, but we can add support natively in the library. My thinking is age value can differ slightly and a 1 year difference in value should give us a strong match

Can you elaborate on gender/sex ? What kind of values you look to be matched ? We try to make elements in this library that have fuzziness in them. For boolean matches like gender, trying to understand where do you thing a fuzzy match can be useful

Coming back to the issue you are seeing with a lower number on records which visually seems stronger.

Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]}
Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.5357142857142857

In this record, none of the elements have a strong match.

Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]}
Matched With: {[{'Susan Smith'}, {'2 88th St Somerville Somerset NJ 08876'}, {'ssmith@curb.org'}, {'908-765-1239'}]} Score: 0.5857864376269051

This record on the other hand has 2 strong matches

With regards to zip code being a proxy for city and state. We did consider that , and the problem with understanding which city or state a zip code points too would require us to lookup external repositories (like ones maintained by US postal offices ). That would be difficult to maintain on a standalone java library. That said, we do have hooks to pre-process the address, before the library starts fuzzy matching. Each element accepts a pre-processing java function, where we could perform some normalization of the address.

Hope this helps.

vishaln79 commented 3 years ago

Hi Manish,

Thanks for the detailed response. My responses are inline:

On Thu, Sep 3, 2020 at 1:29 PM Manish Bhatia notifications@github.com wrote:

Vishal,

On the request to add additional elements, yes absolutely . We are always looking to enhance this with new elements. Your usage of NUMBER for age is correct, but we can add support natively in the library. My thinking is age value can differ slightly and a 1 year difference in value should give us a strong match

This would be great.

Can you elaborate on gender/sex ? What kind of values you look to be matched ? We try to make elements in this library that have fuzziness in them. For boolean matches like gender, trying to understand where do you thing a fuzzy match can be useful

To give you some background on what we are trying to do, we are evaluating the tool for contact tracing de-duplication, The way we have formulated the gender question is similar to this: What is your gender? 1) Female 2) Male 3) Others, Please Specify? In the case of some transmissions such as HIV, the response to gender becomes more ambiguous and varied, I was wondering if your tool could help deal with that.

Coming back to the issue you are seeing with a lower number on records which visually seems stronger.

Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]} Matched With: {[{'Susanna Smithers'}, {'47 Ventura Blvd Somerset Somerset NJ 08873'}, {'ssmithers@cox.net'}, {'732-478-7394'}]} Score: 0.5357142857142857

In this record, none of the elements have a strong match.

  • The Name has 1 word in common
  • the address have a few words missing
  • Email again has some similarity but does not match exactly
  • Phone number, we look for all the 10 digits to be the same, in this case the last digit (3 vs 4) is a mis-match

In our case again, a contact tracer or interviewee might get a digit or two wrong. Any suggestions on how to deal with that?

Person searched: {[{'Susanna Smith'}, {'47 Ventura Boulevard 08873'}, {'ssmith@gmail.com'}, {'7324787393'}]} Matched With: {[{'Susan Smith'}, {'2 88th St Somerville Somerset NJ 08876'}, {'ssmith@curb.org'}, {'908-765-1239'}]} Score: 0.5857864376269051

This record on the other hand has 2 strong matches

  • The Name gives an exact match, the words "Susan" and "Susanna" are considered to be same using the soundex algorithm
  • The email is also considered exact match, since we disregard the domain, when we run matches

I think I was thrown because the phone number and address were much closer in case 1) (especially phone number with a higher weight), but your explanation now makes sense.

With regards to zip code being a proxy for city and state. We did consider that , and the problem with understanding which city or state a zip code points too would require us to lookup external repositories (like ones maintained by US postal offices ). That would be difficult to maintain on a standalone java library. That said, we do have hooks to pre-process the address, before the library starts fuzzy matching. Each element accepts a pre-processing java function, where we could perform some normalization of the address.

Do you have some examples on how to do that?

Hope this helps.

Absolutely. I had an additional question. Have you considered using penalties for mismatches (from what I understood, you only use additive scoring)? This way, if a phone number is missing, there is a lower penalty than for incorrect phone numbers.

Thanks, Vishal.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intuit/fuzzy-matcher/issues/34#issuecomment-686641022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4E6K2B4P63SOGFZKJI7SLSD7HBFANCNFSM4QSTJ7JA .

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

manishobhatia commented 3 years ago

I'll add an issue to get age support going. For gender, let me think through the problem and see if we can allow some custom matching for boolean values.

For phone number , there is some support already. The phone element with just 1 digit mismatch gave with 0.5 score instead of 0 in the second example. The phone element goes through a conversion, which strips all non digits and adds a US country code before it. So a number like this 732-478-7394 gets converted to 17324787394. This makes it a 11 digit number, out of which we look for 10 matching numbers (2 tokens) i.e either 1732478739 or 7324787394 has a match with others. In your example a similar logic was applied to the other number 7324787393 converted to 1732478739 and 7324787393, and fist token found a match giving the whole element a 0.5 match score

To enhance this logic, a custom tokenizer function applied here which can match 8 or 9 digits instead of 10. example:

Function<Element<String>, Stream<Token<String>>> customTokenizer = (element) -> TokenizerFunction.getNGramTokens(9, element);
Element elem = new Element.Builder().setType(PHONE).setValue("17324787394").setTokenizerFunction(customTokenizer).createElement();

On similar lines for the address field, we can write a custom pre-processing function , which instead of a simple lambda function above you could write something more elaborate which makes use of API's that connect to US Postal ZIp code and feeds a normalized address to the library.

For the scoring you mentioned , the library tries not punish the results for lack of data. For elements that do not match will get a 0 score, and for missing element a default 0.5 score will be given. So in a way the average score at a document level will be punished for incorrect matches where a 0 score will pull it down. Let me know if you do not see this in your examples

Thanks

manishobhatia commented 3 years ago

Closing this issue, we released support for AGE ElementType in 1.0.4 Feel free to open a new issue if there are more questions