kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

How to return multiple matches #68

Closed msghankinson closed 1 year ago

msghankinson commented 1 year ago

I have two datasets:

  1. test_pool is a panel of names across years. But sometimes the names change slightly, as they are hand-coded.
  2. test_key contains a sample of the names in test_pool, but only from the most recent year.

Here is simplified example data:

test_pool <- structure(list(year = c(2010, 2012, 2014, 2014),
                              first_name = c("Jose", "Joe", "Jose", "Todd")),
                         row.names = c(NA, -4L),
                              class = c("tbl_df", "tbl", "data.frame"))
test_key <- structure(list(year = c(2014),
                         first_name = c("Jose")),
                    row.names = c(NA, -1L),
                    class = c("tbl_df", "tbl", "data.frame"))

I would like to use fastLink to find every occurrence of "Jose" in test_pool, regardless of the year. Also, I would like capture the times that "Jose" is accidentally written as "Joe". To do so, I have written the following:

fl_out <- fastLink(
  dfA = test_key, 
  dfB = test_pool,
  varnames = c("first_name"),
  stringdist.match = c("first_name"),
  dedupe.matches = F)

matches_out <- getMatches(
  dfA = test_key, 
  dfB = test_pool,
  fl.out = fl_out)

However, this code only records for "Jose" from 2014:

> matches_out
    year first_name gamma.1          posterior
1   2014       Jose       2 0.9694313510032212
1.1 2014       Jose       2 0.9694313510032212

How can I use fastLink to find the records for c("Jose", 2010) and c("Joe", 2012)? Any help at all would be greatly appreciated.

tedenamorado commented 1 year ago

Hi @msghankinson,

I would suggest playing with the cut.a option from fastLink. The option allows you to move the threshold for the agreement in the string-valued variables. In the code below, I moved it to 0.92. Note by default it is set at 0.94. Why do you want to lower the threshold? To capture nicknames. However, I would not recommend lowering it too much as you will start finding a lot of false positives for names that are similar e.g., Anna and Annabelle.

fl_out <- fastLink(
  dfA = test_pool, 
  dfB = test_key,
  varnames = c("first_name"),
  cut.a = 0.92,
  stringdist.match = c("first_name"),
  dedupe.matches = F)

matches_out <- getMatches(
  dfA = test_pool, 
  dfB = test_key,
  fl.out = fl_out)

Note that I changed the order in which the datasets were entered in the functions i.e., test_pool first and test_key second. The function matches_out is written using dfA as the target. Alternatively, you can call the matches by just typing:

test_pool[fl_out$matches$inds.a, ]
test_key[fl_out$matches$inds.b, ]

I hope this helps! If anything, let us know.

All my best,

Ted

msghankinson commented 1 year ago

This worked perfectly! Thank you so much for the quick response and detailed feedback. - Michael