intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Date Match with NeighborhoodRange greater than 0.91 fails to give valid results #35

Closed manishobhatia closed 3 years ago

manishobhatia commented 3 years ago

DATE element type allows user to override the NeighborhoodRange , but a value of greater than 0.91 causes poor matches to show up.

This is a test case to trigger it in MatchServiceTest.java that can show the failure

@Test
    public void itShouldApplyMatchWithDate() {
        List<Object> dates = Arrays.asList(getDate("01/01/2020"), getDate("01/02/2020"), getDate("07/15/2019"));
        List<Document> documentList = getTestDocuments(dates, DATE, 0.91);
        Map<Document, List<Match<Document>>> result = matchService.applyMatch(documentList);
        result.entrySet().forEach(entry -> {
            entry.getValue().forEach(match -> {
                System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
            });
        });

        Assert.assertEquals(2, result.size());
    }

As we increate the value greater than 0.91, dates that are not in the neighborhood shows up in results.

The issue is primarily in the incorrect usage of this https://github.com/intuit/fuzzy-matcher/blob/d2ce6f6f53a2ea5b1d628bd2fb0aec5d1d22bc5a/src/main/java/com/intuit/fuzzymatcher/component/TokenRepo.java#L89

It is increasing the TokenRanges lower and higher bounds to broader values, causing incorrect matches to show up.

aavaas commented 3 years ago

Hi! I would like to take on this bug! Please assign.

Thank you!