intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Fuzzy matching issue : only fetching the exact match #52

Closed deepak-Jenkins closed 2 years ago

deepak-Jenkins commented 3 years ago

I am testing with the below code. Here FuzzyTitle is an user defined class.

FuzzyTitle docSearch = new FuzzyTitle("1", "Match", "Max Studio", "Produced"); FuzzyTitle docAvailable1 = new FuzzyTitle("2", "Match", "Dream Work", "Produced"); FuzzyTitle docAvailable2 = new FuzzyTitle("3", "Match Rain", "Dream Work", "Released");

Document searchDocument = new Document.Builder(docSearch.getCounter()).addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docSearch.getTitle()).createElement()) .createDocument();

Document searchDocAvailable1 = new Document.Builder(docAvailable1.getCounter()) .addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docAvailable1.getTitle()).setWeight(.70D).createElement()).createDocument();

Document searchDocAvailable2 = new Document.Builder(docAvailable2.getCounter()).addElement(new Element.Builder().setType(ElementType.TEXT).setValue(docAvailable2.getTitle()).setWeight(.70D).createElement()).createDocument();

List documentList = new ArrayList<>(); documentList.add(searchDocAvailable1); documentList.add(searchDocAvailable2);

MatchService matchService = new MatchService(); Map<Document, List<Match>> map; map = matchService.applyMatch(searchDocument, documentList); System.out.println("Match : " + map);

OutPut : Match : {{[{'Match'}]}=[Match{data={[{'Match'}]}, matchedWith={[{'Match'}]}, score=1.0}]}

Query :

  1. Ideally 1 and 3 should also match as both of them have similarity. Only exact match is coming. Am I doing something wrong here.
  2. I want to distribute weightage to 2nd parameter as 70%, 3rd parameter as 20%, 4th parameter as 10%. So should I place weightage in all docs or only in available docs.
  3. As in this case all parameters are of type ElementType.TEXT, so if this will trigger any wrong result putting all elements type as same TEXT.

I have kept the example simple. Definitely, I am doing something wrong. Requesting for help and clarification please.

manishobhatia commented 3 years ago

Hi,

  1. The match for each element score looks at unique words and finds common words divided by all unique words. So in case of Match and Match Rain we has 2 words with 1 common, so a result of 0.5. The default threshold to consider document to match is greater than 0.5. So this can be fixed by changing the threshold at document to 0.49
  2. The weight distribution can be applied at each element of a document. But it does need to be consistent for all the documents. It's easier to generate Document object in a separate utility, so that it's consistent.

Here is an example of your code, refactored to make it easier. I just converted FuzzyTitle to String[] you can convert it back

private Document generateDocs(String[] stringDoc){
        Document document = new Document.Builder(stringDoc[0])
                .addElement(new Element.Builder().setType(ElementType.TEXT)
                        .setValue(stringDoc[1])
                        .setWeight(.70D).createElement())
                .setThreshold(0.49)
                .createDocument();
        return document;
    }

String[] docSearch = {"1", "Match", "Max Studio", "Produced"};
String[] docAvailable1 = {"2", "Match", "Dream Work", "Produced"};
String[] docAvailable2 = {"3", "Match Rain", "Dream Work", "Released"};

Document searchDocument = generateDocs(docSearch);

Document searchDocAvailable1 = generateDocs(docAvailable1);

Document searchDocAvailable2 = generateDocs(docAvailable2);

List documentList = new ArrayList<>();
documentList.add(searchDocAvailable1);
documentList.add(searchDocAvailable2);

MatchService matchService = new MatchService();
Map<Document, List<Match>> map;
map = matchService.applyMatch(searchDocument, documentList);
System.out.println("Match : " + map);
  1. Now that you have a separate utility that is applied to all document, you can add weights to each element and have it consistent across all document

Here is an example

private Document generateDocs(String[] stringDoc){
        Document document = new Document.Builder(stringDoc[0])
                .addElement(new Element.Builder().setType(ElementType.TEXT)
                        .setValue(stringDoc[1])
                        .setWeight(.70D).createElement())
                .addElement(new Element.Builder().setType(ElementType.NAME)
                        .setValue(stringDoc[2])
                        .setWeight(.20D).createElement())
                .addElement(new Element.Builder().setType(ElementType.TEXT)
                        .setValue(stringDoc[3])
                        .setWeight(.10D).createElement())
                .setThreshold(0.49)
                .createDocument();
        return document;
    }

But now that multiple elements with different weights are being used, the score and match will be different. But you can change the threshold and see what works best

Hope this helps

deepak-Jenkins commented 3 years ago

Dear Manish,

Thank you for your kind response.

Excellent !! Correctly mentioned, I was missing out the threshold setting. Must say, a great tool for calculating the score while comparing.

Few things I observed when I tested with different data and cases. I would be thankful if you can help me with your expertise.

  1. For comparing 'Stop' with 'Stop(2002)' or 'Stops' should return some values. I tested using setType NAME and TEXT. In both the cases its returning 0 score. What changes I should make to get the score?

  2. For comparing Numbers, I tested with '1234' & '12345'. It is giving zero matching. I tried with setType NUMBER. How can I improve the result in this case.

  3. In case of multiple parameters with weighted value. I used the suggested code.

    generateFuzzyDocs(FuzzyTitle TitleDoc){ Document document = new Document.Builder(TitleDoc.getCounter()) .addElement(new Element.Builder().setType(ElementType.NAME) .setValue(TitleDoc.getTitle().toString()) .setWeight(.70D).createElement()) .addElement(new Element.Builder().setType(ElementType.TEXT) .setValue(TitleDoc.getDistributor()) .setWeight(.20D).createElement()) .addElement(new Element.Builder().setType(ElementType.TEXT) .setValue(TitleDoc.getContent()) .setWeight(.10D).createElement()) .setThreshold(0.00) .createDocument(); return document; }

weights have been given as .70D, .20D, .10D. And in case of below comparison first 2 parameters are exactly matching and 3rd parameter is partially matching. Yet score is giving as 47% only. Ideally it should give more than 90% score. In the second case the more weighted parameter is partially matching yet its giving 65% which is more than first case. Am I missing out something here ?

Match : {{[{'Spider'}, {'Columbia Pictures'}, {'Adam'}]}=[Match{data={[{'Spider'}, {'Columbia Pictures'}, {'Adam'}]}, matchedWith={[{'Spider'}, {'Columbia Pictures'}, {'AdamJones'}]}, score=0.47368421052631576}]}

Match : {{[{'Inventor'}, {'Columbia Pictures'}, {'Adam'}]}=[Match{data={[{'Inventor'}, {'Columbia Pictures'}, {'Adam'}]}, matchedWith={[{'The Inventor'}, {'Columbia Pictures'}, {'Adam'}]}, score=0.65}]}

  1. I am yet to test setType DATE. What process I should follow while comparing given dates to get proper score values. '22/05/2002', '22/06/2002', '22/05/2020', '08/04/2014'

  2. For alpha numeric value comparison what settype is preffered ? NAME, ADDRESS or TEXT.

Thank you much in Advance.

manishobhatia commented 3 years ago
  1. The default mechanism to tokenize elements of type TEXT and NAME look at whole words, that would require them to be separated by spaces. For elements you described that do not have a well defined separators but sequential characters N-Gram tokenizer is more appropriate. You can override the default in a TEXT element with a predefined trigram tokenizer .setTokenizerFunction(TokenizerFunction.triGramTokenizer()))

  2. Matching numbers '1234' & '12345' using NUMBER type expects numerical values that are close to each other , so you will find 1240 to be a better match to 1234 than 12345. It seems like you want the numerical character to match instead of the actual value. In that case TEXT type will be a better fit here, and you can override it with N-Gram tokenizer like above, since these are not words but sequence of characters.

  3. In my earlier post I apologise for not calling this out. But when working with weights , the default is considered as 1.0 for all elements. To give a higher weight to a few elements, we need to provide a number greater than 1. So leave the least significant element to 1.0 and increase the weights of other elements . Try 4.0 on first , 2.0 on second and 1.0 on third , that should probably give the result you were looking for.

  4. Dates work similar to numerical values, it uses nearest neighbor match where dates close each other match better than ones further apart. You can play with .setNeighborhoodRange parameter while creating an element. This takes values 0 - 1.0 where a higher value will match dates closer to each other.

  5. For alpha TEXT is again preferred, the big difference is on choice of tokenizer. If you do not want each word to be compared independently then override the TokinizerFunction

deepak-Jenkins commented 3 years ago

Dear Manish,

I applied triGramTokenizer() and weightage concept as you suggested and the results are coming as expected :). For TEXT and NAME type comparison the process is absolutely clear. Thanks a lot for your awesome guidance.

I tried to apply .setNeighborhoodRange with number values as suggested. But the result is coming as either 1 or 0. It's not coming as matching % as TEXT and NAME. I am using the below code. Document document = new Document.Builder(StringDoc[0]) .addElement(new Element.Builder().setType(ElementType.NUMBER) .setValue(Integer.valueOf(StringDoc[1])) .setMatchType(MatchType.NEAREST_NEIGHBORS) .setNeighborhoodRange(.99D) .setThreshold(0.00) .createElement())
.setThreshold(0.00)
.createDocument();

Also I tried with below set up but no luck. Document document = new Document.Builder(StringDoc[0]) .addElement(new Element.Builder().setType(ElementType.NUMBER) .setValue(Integer.valueOf(StringDoc[1])) .setMatchType(MatchType.EQUALITY) .setThreshold(0.00) .createElement())
.setThreshold(0.00)
.createDocument();

I have not tried with Date field yet. Hopefully, once I am clear with NUMBER type DATE type should be similar.

Thank you much in Advance.

deepak-Jenkins commented 3 years ago

Dear Manish,

In addition to my previous comment, I also tried with adding valueTokenization as below but did not succeed. Its same giving 1 or 0 but not any in between percentage matching.

Document document = new Document.Builder(StringDoc[0]) .addElement(new Element.Builder().setType(ElementType.NUMBER) .setValue(Integer.valueOf(StringDoc[1])) .setTokenizerFunction(TokenizerFunction.valueTokenizer()) .setMatchType(MatchType.NEAREST_NEIGHBORS)
.setNeighborhoodRange(.99D) .setThreshold(0.00) .createElement())
.setThreshold(0.00)
.createDocument(); return document;

Also I tried to work with Date field. I am trying to set value by below code but it is raising exception at run time. com.intuit.fuzzymatcher.exception.MatchException: Unsupported data type.

            Date d = new Date();
    try{
        d = new SimpleDateFormat("dd/MM/yyyy").parse(StringDoc[1]);
    }catch(Exception ex){           
    }
    Document document = new Document.Builder(StringDoc[0])
            .addElement(new Element.Builder().setType(ElementType.DATE)
                    .setValue(d)
                    .setMatchType(MatchType.NEAREST_NEIGHBORS)
                    .setTokenizerFunction(TokenizerFunction.valueTokenizer())
                    .setNeighborhoodRange(.99D)
                    .setThreshold(0.00)
                    .createElement())
            .setThreshold(0.00)        
            .createDocument();

I request you for your help with Number and Date Type comparison.

Thank you much.

manishobhatia commented 3 years ago

The score for NUMBER and DATE is expected to be either 1 or 0. The way scoring works in other elements depends on tokenization. Where each element is broken down into smaller tokens and a percentage of matching tokens is calculated.

Since there is no way to break down these numerical and date values, you will always get a score of 1 if they are within the neighbourhood range, and 0 otherwise

The Unsupported data type exception occurs with the value you are passing is not valid date. I am not able to reproduce it with the code you have. My only suspect is that the Date String is not a valid Date type, and since we are silencing the exception by catching it, that failure is going un-noticed.

deepak-Jenkins commented 3 years ago

Dear Manish,

Got it. Thank you for your help and guidance. I implemented the concepts and results are mostly as expected :).

I observed one case as below.

Suppose I have name, address and phone number to match and I prepare 2 documents with weightage to each sections. But here let's assume in 1st document name is 'Rohan' and in 2nd document in the address it is 'Sarabai Rohan Street'. So here 'Rohan' is common for which it is setting more matching score. Ideally name should be compared with name and address with address.

Any suggestion in this scenario ?

Thank you.

manishobhatia commented 3 years ago

Hi,

This behavior is not expected for sure. Each element only matches similar items within that element . I am unable to replicate the scenario you laid out. Can you share your code snippet ? That should help me debug it

On Aug 19, 2021, at 2:56 AM, deepak-Jenkins @.***> wrote:

 Dear Manish,

Got it. Thank you for your help and guidance. I implemented the concepts and results are mostly as expected :).

I observed one case as below.

Suppose I have name, address and phone number to match and I prepare 2 documents with weightage to each sections. But here let's assume in 1st document name is 'Rohan' and in 2nd document in the address it is 'Sarabai Rohan Street'. So here 'Rohan' is common for which it is setting more matching score. Ideally name should be compared with name and address with address.

Any suggestion in this scenario ?

Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

manishobhatia commented 2 years ago

closing this issue for now. Feel free to open it, if you feel it is still unresolved