larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
613 stars 194 forks source link

Duke Handling of Missing Values #241

Open atifijazkhan opened 7 years ago

atifijazkhan commented 7 years ago

Consider the following dataset: 1,john,doe 2,john, 3,john,watson

For matching purposes, I am assuming that both attributes are of equal importance and hence high=0.999 and low=0.001 has been set with Exact Comparator matching.

Normally the expectation is that

1: 1-match-1: produce match score of ~1

2: 1-match-2: produce a match score somewhere between 0.5 and 1, but much lower than #1

3: 1-match-3: produce a match score ~0.5 (as we are matching on 1 attribute).

I get the following scores:

1: 1-match-1: Overall: 0.999998997998

2: 1-match-2: Overall: 0.999

3: 1-match-3: Overall: 0.4999999999999998

Notice how close the scores are for #1 and #2. I understand that Duke ignores missing values. However, if I wanted to process missing values, what would be the best course of action.

I would like to achieve something like the following:

1: 1-match-1: Overall: 0.999998997998

2: 1-match-2: Overall: 0.75

3: 1-match-3: Overall: 0.4999999999999998