kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

dedupeMatches does not consider exact matches #78

Open jw2249a opened 6 months ago

jw2249a commented 6 months ago

The deduplication appears to take the match pattern and matched value's index and take the highest zeta value, but does not account for zeta values that are exactly equal. This leads to weird behavior.

Prefact: Issue can be recreated if you append the first row of dfA (where firstname is "daniel") to both dfA and dfB. This means the record will be an exact match to a row in dfA and dfB.

Issue 1: The dedupe algorithm will return all of the matched values as setup above. However, if you change the value of the firstname in the first row to NA, then it will be removed.

Issue 2: f you change the lastname "secuya" to "secuyas" while leaving the first name as NA, it will still be removed by the dedupe function. But, if you add the name "daniel" back to the firstname, it will not be deduped.

tedenamorado commented 6 months ago

Thanks for letting us know! I will try to reproduce what you describe and report back.

jw2249a commented 5 months ago

I found the issue with the deduplication. The order of the dataframes matters because the duplicate row ids are removed before checking for them again in dfb.