Open MNSleeper opened 1 year ago
@aubertc
Yes, running LD after the data has been imported is best. It also have 2 other advantages:
On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.
Yes, we should weight those things: if the LD of, say, the first and last name is very small, but the email addresses are different, then we should probably not match them. We should, in a first approximation, match things (= fix spelling) only if we have other elements that support that those spelling refer to the same entity.
Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?
We should use a multi-variate score: if the LD on the attributes used for identification (say, first name + last name + email + affiliation) are 0, 0, 0, 6, then we may be good (this would be "Univ. of Massachusetts" Vs. "University of Massachusetts"), but anything with no 0 at all (that is, no exact match) is probably suspicious. Finding the right balance will be hard, but we should definitively take more than one LD into account before taking a decision.
In order to implement LD, There are two main ways, listed below
The latter option seems best.
On top of this, there's one small issue of what if a misspelling gets accepted as the proper way to spell something, and every proper spelling of a word gets "corrected" to the improper way. My only thought is having an internal dictionary of all proper spellings.
Finally, what should an acceptable LD be? Should we use a static number, like 3, or base it off some metric like the length of the string being used?