Closed msghankinson closed 2 years ago
Hi @msghankinson,
I would suggest playing with the cut.a
option from fastLink. The option allows you to move the threshold for the agreement in the string-valued variables. In the code below, I moved it to 0.92. Note by default it is set at 0.94. Why do you want to lower the threshold? To capture nicknames. However, I would not recommend lowering it too much as you will start finding a lot of false positives for names that are similar e.g., Anna
and Annabelle
.
fl_out <- fastLink(
dfA = test_pool,
dfB = test_key,
varnames = c("first_name"),
cut.a = 0.92,
stringdist.match = c("first_name"),
dedupe.matches = F)
matches_out <- getMatches(
dfA = test_pool,
dfB = test_key,
fl.out = fl_out)
Note that I changed the order in which the datasets were entered in the functions i.e., test_pool
first and test_key
second. The function matches_out
is written using dfA
as the target. Alternatively, you can call the matches by just typing:
test_pool[fl_out$matches$inds.a, ]
test_key[fl_out$matches$inds.b, ]
I hope this helps! If anything, let us know.
All my best,
Ted
This worked perfectly! Thank you so much for the quick response and detailed feedback. - Michael
I have two datasets:
Here is simplified example data:
I would like to use fastLink to find every occurrence of "Jose" in test_pool, regardless of the year. Also, I would like capture the times that "Jose" is accidentally written as "Joe". To do so, I have written the following:
However, this code only records for "Jose" from 2014:
How can I use fastLink to find the records for c("Jose", 2010) and c("Joe", 2012)? Any help at all would be greatly appreciated.