Closed bengoehring closed 1 year ago
Disclaimer: I am a regular fastLink
user, not a developer.
I suggest that you need at least one more linkage concept than Gender
and Name
.
One of several possible recommendations is to add Address
or Date_of_birth
; the conceptual algorithm is known as ADGN
(Ansolabehere and Hersh 2017). As another example, I often find the 9-digit Social Security Number (SSN
) to be very useful as a linkage variable.
I do not see you use fiscal_year
for the linkage. It seems not to be needed to illustrate your issue.
Reference: Stephen Ansolabehere & Eitan D. Hersh (2017) ADGN: An Algorithm for Record Linkage Using Address, Date of Birth, Gender, and Name, Statistics and Public Policy, 4:1,1-10, DOI: 10.1080/2330443X.2017.1389620
Anders
Thanks for getting back to me!
Yes, additional linkage variables would be great but unfortunately I often am working with just names. If I need to just increase cut.p
when I am using only names as linkage variables that is fine. I just wanted to be sure I was not missing something obvious.
Ben
Is your purpose to have as few as possible duplicates? Then, you could combine the names into a more discriminating variable, which would result in less duplicates than now. For example, you could create one name variable from first_name
+ middle_name_initial
+ last_name_initial
. In the example, the first four names would be "jamesea", "jameseb", "jameseb", and "jamesed".
Thanks @aalexandersson for always providing great advice!
The problem is that for most observations in your sample data, the middle initial is just one letter, so it is basically a categorical variable that can take around 26 possible values (22 in your sample data if you trim middle names to be represented by just one letter).
@bengoehring did you try removing middle_name_initial
from the list of variables that will be compared using a string similarity comparator? If you do so, then the comparison for the middle initial will be made in terms of exact matching for that variable.
Keep us posted!
Ted
Thanks everybody. I really appreciate all of the suggestions. I will try exact matching on the middle name/initial variable and see how that looks.
Hi there,
Thank you for writing, and especially maintaining, such a great package. I worry this is going to be a silly question -- and I apologize if that is the case.
I am trying to assign unique ids to a roster of names. It seems that some of the matches however are too inclusive and include, as in the example below, strings that seem to be too distinct from one another to be considered matches. At the bottom I am including a screenshot of an example of this on the full dataset so you can get a better sense of the range of different names that are considered matches.
I am guessing this behavior can be fixed by tweaking some parameters (even though the threshold matching level is > .9?), but I wanted to bring it up here too in case something else is going awry.
Thanks again, Ben