intuit / fuzzy-matcher

A Java library to determine probability of objects being similar.
Apache License 2.0
226 stars 69 forks source link

Threshold matching not changing for ElementType: DATE #32

Closed alephzed closed 4 years ago

alephzed commented 4 years ago

I am trying to understand how to change the threshold matching for a set of 3 dates. "07/15/2019", "01/01/2020", and "01/02/2020". The matching score always comes back from 1.0 from for all three dates. I have tried changing the threshold on the Element level and the Document level and it doesn't make a difference. How can I change the matching so that a date 07/15/2019 does not match 01/01/2020 with the same score that 01/01/2020 matches 01/02/2020? Here is my sample code: Similar to the junit test I found in your project, but modified for a springboot application.


@Component
public class FuzzylogicRunner implements CommandLineRunner {

    @Override
    public void run(String... args) {
        MatchService matchService = new MatchService();

        List<Object> dates = Arrays.asList(getDate("07/15/2019"), getDate("01/01/2020"), getDate("01/02/2020"));
        List<Document> documentList1 = getTestDocuments(dates, DATE, null);
        Map<Document, List<Match<Document>>> result1 = matchService.applyMatch(documentList1);

        result1.forEach((key, value) -> value.forEach(match -> {
            System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
        }));
    }

    private List<Document> getTestDocuments(List<Object> values, ElementType elementType, Double neighborhoodRange) {
        AtomicInteger ai = new AtomicInteger(0);
        return values.stream().map(num -> {
            Element.Builder elementBuilder = new Element.Builder().setType(elementType).setValue(num).setThreshold(0.1);
            if (neighborhoodRange != null) {
                elementBuilder.setNeighborhoodRange(neighborhoodRange);
            }
            return new Document.Builder(Integer.toString(ai.incrementAndGet()))
                    .addElement(elementBuilder.createElement()).setThreshold(0.1)
                    .createDocument();
        }).collect(Collectors.toList());
    }

    private Date getDate(String val) {
        DateFormat df = new SimpleDateFormat("MM/dd/yyyy");
        try {
            return df.parse(val);
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return null;
    }
}

The output is always (even if I modify the threshold from 0.1 to 0.9: Data: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Matched With: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Score: 1.0 Data: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Matched With: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Score: 1.0 Data: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Matched With: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Score: 1.0 Data: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Matched With: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Score: 1.0 Data: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Matched With: {[{'Mon Jul 15 00:00:00 MDT 2019'}]} Score: 1.0 Data: {[{'Thu Jan 02 00:00:00 MST 2020'}]} Matched With: {[{'Wed Jan 01 00:00:00 MST 2020'}]} Score: 1.0

manishobhatia commented 4 years ago

Hi Adam,

For Dates and Numbers this feature is supported in the latest version by neighborhoodRange attribute in the Element. It defaults to 0.9 . In the test, you can increase it to 0.901 to get the results you are looking for List<Document> documentList1 = getTestDocuments(dates, DATE, 0.901);

I noticed a bug where you can't set the value beyond 0.91 . I'll fix it in the upcoming release .

Thanks, Manish

alephzed commented 4 years ago

Thank you Manish, manipulating the neighborhoodRange parameter is what I needed.