popbr / data-integration

Apache License 2.0
1 stars 4 forks source link

Implementing Levenshtein Distance #18

Open MNSleeper opened 1 year ago

MNSleeper commented 1 year ago

In order to implement LD, There are two main ways, listed below

The latter option seems best.

On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.

Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?

MNSleeper commented 1 year ago

@aubertc

aubertc commented 1 year ago

Yes, running LD after the data has been imported is best. It also have 2 other advantages:

On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.

Yes, we should weight those things: if the LD of, say, the first and last name is very small, but the email addresses are different, then we should probably not match them. We should, in a first approximation, match things (= fix spelling) only if we have other elements that support that those spelling refer to the same entity.

Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?

We should use a multi-variate score: if the LD on the attributes used for identification (say, first name + last name + email + affiliation) are 0, 0, 0, 6, then we may be good (this would be "Univ. of Massachusetts" Vs. "University of Massachusetts"), but anything with no 0 at all (that is, no exact match) is probably suspicious. Finding the right balance will be hard, but we should definitively take more than one LD into account before taking a decision.