kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
260 stars 47 forks source link

Dealing with aliases in FastLink #75

Open jkafka opened 11 months ago

jkafka commented 11 months ago

I need to conduct a linkage in R using both deterministic and probabilistic methods. The identifying fields we are using are first name, last name, and date of birth. One of our datasets includes information about civil legal system involvement, and the other involves information about arrests. Especially in the arrest data, a single person might have multiple aliases or different dates of birth recorded. It's hard to know which of those are legally correct, and sometimes only a combination of information across alias records provides the full picture about a person's identity. We do have a "fingerprint ID" that allows us to see how a person's identity has been recorded across time in the arrest data.

Is there a way to use the FastLink package that allows us to keep (and leverage) all the nuanced information provided across these aliases when we undertake the linkage? Or is it necessary to somehow de-duplicate the arrest data and choose a single name for each person (which feels arbitrary, and will inevitably lead to a loss in some important data that could critically improve the validity of the linkage).

An example dataset is available here. As you'll notice, the first person has 4 different entries with minor variations in first name, last name, and DOB.

arrests <- data.frame(fingerprint_id = c("123321", "123321", "123321", "123321", "431940", "532523"), first = c("Joseph", "Johan", "Johan", "Johan", "Kristn", "Adam"), last = c("Shmo", "Shomseff", "Shomseff", "Shomsef", "Mickleson", "Gregerson"), dob = c("05/25/1987", "05/25/1987", "02/25/1987", "02/25/1987", "01/17/1955", "06/05/1995"))

Thanks for any guidance on this. I'm new to the linkage world.

aalexandersson commented 11 months ago

Disclaimer: I am a fastLink user, not a fastLink developer

The accuracy of the linkage results will greatly improve if you have more and better linkage variables such as Social Security Number. But I would like to focus my reply comments on fastLink.

I am not aware of an always-best answer. The fastLink article (2019, page 362) found no distinguishable difference between a one-to-one match and a one-to-many match. Therefore, personally, I almost always use a one-to-one match. But in some other cases it is better to de-duplicate before doing a record linkage. It depends on the context, for example what you know about the data, how much time you have for doing the record linkage, including a manual ("clerical") review of uncertain record pairs, and what is an acceptable threshold matching probability.

Regardless, I recommend to pay attention to the de-duplication results -- especially if you main concern is wrong matches (false positives) rather than missed matches (false negatives). The default is dedupe.matches = TRUE , which has two algorithms: one default "greedy" (faster but simple) and one optional "linear programming" (slower but recommended for accuracy). If you have a large linkage, then the default de-duplication algorithm might not be accurate enough and the optional algorithm might be too slow. In that situation, you can still get good results by running fastLink a second time by changing to dedupe.matches = FALSE. This will give you two different results with the only difference being the de-duplication. Then, you can do a manual review of those uncertain record pairs.

Also, fastLink does not require exact matching of an individual linkage variable. In fact, you are encouraged to try out different and more flexible matching options depending on the data and what you know about the data. A few examples are "Jaro-Winkler" and "Double Metaphone" of name, and using age instead of dob.

In summary, the fastLink defaults are usually good enough that there is nothing special that you have to do -- except maybe if you need blocking and a confusion table. If you will have any such issues, see my comment in issue #63. The fastLink developers are working on updating fastLink for easier, faster, and more accurate results, so stay tuned!