Implementing Levenshtein Distance

MNSleeper commented 1 year ago

In order to implement LD, There are two main ways, listed below

run an LD test on every entry of the array of strings that gets passed to SQL once a file is parsed. This would correct any misspellings in the file list before it gets put into SQL, However, two different datasets could have two different spellings of the same entity, so the LD test would have to be run again.
Run an LD test after a datasets has been put into SQL but before running the Linkage method on on that particular table, likely as a part of the LinkTable method itself. This method would ensure that all misspellings across all tables are corrected.

The latter option seems best.

On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.

Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?

MNSleeper commented 1 year ago

@aubertc

aubertc commented 1 year ago

Yes, running LD after the data has been imported is best. It also have 2 other advantages:

We should not edit the data that is imported: we should import it "raw" so that it is always accessible for future improvements.
Fixing errors should be done when we are matching things. That is, we could have two entries (possibly in separate tables) with different attributes used for identification (say, first name + last name + email) that are still considered as representing a single entity because … and then we would need to store the reason, which could be..
- Possibly misspelling,
- Change of email address,
- Alternative spelling of one of the name,
- etc.

On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.

Yes, we should weight those things: if the LD of, say, the first and last name is very small, but the email addresses are different, then we should probably not match them. We should, in a first approximation, match things (= fix spelling) only if we have other elements that support that those spelling refer to the same entity.

Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?

We should use a multi-variate score: if the LD on the attributes used for identification (say, first name + last name + email + affiliation) are 0, 0, 0, 6, then we may be good (this would be "Univ. of Massachusetts" Vs. "University of Massachusetts"), but anything with no 0 at all (that is, no exact match) is probably suspicious. Finding the right balance will be hard, but we should definitively take more than one LD into account before taking a decision.

popbr / data-integration

Implementing Levenshtein Distance #18