intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Matching two strings #49

Closed nayan-jyoti closed 3 years ago

nayan-jyoti commented 3 years ago

Hi,

I came across this library and tried some very elementary test, matching two strings to find the score. After going through readme file, several null pointer exceptions and issues already raised, I finally able to play something around that by doing this: `List input = Arrays.asList(firstStr, secondStr);

    Document document = null;
    List<Document> documentList = new ArrayList<>();
    for (String str : input) {
        document = new Document.Builder(str)
                .addElement(new Element.Builder<String>().setValue(str).setType(ElementType.TEXT).createElement())
                .createDocument();
        documentList.add(document);
    }
    Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);

    result.entrySet().forEach(entry -> {
        entry.getValue().forEach(match -> {
            System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: "
                    + match.getScore().getResult());
        });
    });`

However it is found that:

  1. The strings match each other twice, i.e firstStr matches to secondStr and then vice-versa while only a single match was required.
  2. I tried string with numbers like "Nayan J Bayan 123" with elementType TEXT, the result came out as a blank map. What should be the element type if I want to check strings containing numbers and special characters? I also tried with ADDRESS type and same blank map was returned.
  3. I don't think at all my implementation is correct. What should have been the correct implementation?

Thanks

manishobhatia commented 3 years ago

Hi Nayan,

The usage of the library looks accurate .

  1. Regarding 2 results being displayed. The intention is to show which documents match with others. If you prefer you could use applyMatchByGroups which will club all the matching elements in a single group and display it.
  2. The numbers as input should not be a problem to match. I think your match result might have fallen below the default threshold (0.5) . Each element is separated by a space and matched with others, in the example "Nayan J Bayan 123" it will have 4 tokens , so if you have a matching element with more than 2 tokens that are similar it should match.

Here are modifications to your example with applyMatchByGroups

       String[] input = new String[]{"Nayan J 123", "Nayan J Bayan 123"};

        Document document = null;
        List<Document> documentList = new ArrayList<>();
        for (String str : input) {
            document = new Document.Builder(str)
                    .addElement(new Element.Builder<String>().setValue(str).setType(ElementType.TEXT).createElement())
                    .createDocument();
            documentList.add(document);
        }
        Set<Set<Match<Document>>> result = matchService.applyMatchByGroups(documentList);

        result.forEach(entry -> {
            entry.forEach(match -> {
                System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: "
                        + match.getScore().getResult());
            });
        });

Hope this helps

Thanks

nayan-jyoti commented 3 years ago

Hi Manish,

Thanks for prompt response. It was indeed helpful.

document = new Document.Builder(str).addElement(new Element.Builder<String>().setValue(str) .setType(ElementType.TEXT).setThreshold(0.0).createElement()).createDocument();

I played with few more string pairs after setting threshold value to 0.0 like above:

  1. "Vrij Bhooshan" & "VRIJA BHOOSHAN"
  2. "Mohammad Ashfaque" & "MOHAMMED ASHFAQUE"
  3. "Nayan Bayan" & "Nayan Jyoti"
  4. "Nayan Jyoti Bayan Test" & "Nayan Jyoti B T"

However none of them returned any result. Is it expected behaviour?

Edit:

  1. I tried to set match type to Nearest Neighbour. However on running, I got below exception: com.intuit.fuzzymatcher.exception.MatchException: Data Type not supported Tried the same with ElementType.NAME, met same exception.
  2. Changed the element type to NAME and was able to get very good score of 1.0 for pairs 1 and 2. However no luck with the rest of two.

Thanks

manishobhatia commented 3 years ago

Hi Nayan,

ElementType.NAME is a better choice for this kind of match. It uses Soundex to match names, and negates any misspelled or closely spelled differences in names. Nearest Neighbors are a better choice for numeric and date type elements where values are near each other and not the same.

I see you have reduced Threshold value. If you use that on the document instead of Element, you will see them match

String[] input = new String[]{"Nayan Jyoti Bayan Test", "Nayan Jyoti B T"};

        Document document = null;
        List<Document> documentList = new ArrayList<>();
        for (String str : input) {
            document = new Document.Builder(str)
                    .addElement(new Element.Builder<String>().setValue(str).setType(ElementType.NAME).createElement())
                    .setThreshold(0.49)
                    .createDocument();
            documentList.add(document);
        }
        Set<Set<Match<Document>>> result = matchService.applyMatchByGroups(documentList);

        result.forEach(entry -> {
            entry.forEach(match -> {
                System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: "
                        + match.getScore().getResult());
            });
        });
nayan-jyoti commented 3 years ago

Thanks for the explanation. Able to get scores after setting threshould at document level. Thank you again for your time.

Closing the issue as my queries are answered