kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
258 stars 46 forks source link

Using partial.match argument results in different matches #27

Closed sysilviakim closed 6 years ago

sysilviakim commented 6 years ago

I was attempting a simple record linkage as follows, and I noticed something very strange. When I specify the partial.match argument, the outcome of the match is different from when it is not specified.

In the following made-up data, there are four entities in each dataframe. Entities 2, 3, and 4 are same people. 2 and 4 are exactly the same records. For 3, the address changes.

library(tidyverse)
library(fastLink)

dfA_synthetic <- data_frame(
  NameLast = c("Kim", "Lee", "Park", "Choi"), 
  NameFirst = c("Julie", "Joanna", "Jessica", "Jennifer"), 
  Address = c("500 E 6th St",  "400 W 5th Rd", 
              "100 S Main St", "200 N Main St"), 
  City = c("Santa Ana", "Laguna Hills", "Fullerton", "Pasadena"), 
  StreetName = c("6th", "5th", "Main", "Main"), 
  StreetSuffix = c("St", "Rd", "St", "St"), 
  MailAddress1 = c("500 E 6th St",  "400 W 5th Rd", 
                   "100 S Main St", "200 N Main St"), 
  MailAddress2 = c("Santa Ana CA 92701", "Laguna Hills CA 92653", 
                   "Fullerton CA 92831", "Pasadena CA 91106"), 
  Phone = c("", "", "(626)395-4701", "(626)529-3219"), 
  BirthDate = c("01/01/1988", "02/02/1977", "03/03/1999", "04/04/2000")
)

dfB_synthetic <- data_frame(
  NameLast = c("Hong", "Lee", "Park", "Choi"), 
  NameFirst = c("Jean", "Joanna", "Jessica", "Jennifer"), 
  Address = c("600 S Catalina St",  "400 W 5th Rd", 
              "100 S Main St", "200 N Main St"), 
  City = c("Los Angeles", "Laguna Hills", "Fullerton", "Pasadena"), 
  StreetName = c("6th", "5th", "Main", "Main"), 
  StreetSuffix = c("Dr", "Rd", "St", "St"), 
  MailAddress1 = c("600 S Catalina Dr",  "400 W 5th Rd", 
                   "PO Box 3000", "200 N Main St"), 
  MailAddress2 = c("Pasadena 91125", "Laguna Hills CA 92653", 
                   "Anaheim CA 92800", "Pasadena CA 91106"), 
  Phone = c("", "", "(626)395-4701", "(626)529-3219"), 
  BirthDate = c("09/09/1966", "02/02/1977", "03/03/1999", "04/04/2000")
)

m_synthetic_1 <- fastLink(
  dfA = dfA_synthetic, dfB = dfB_synthetic, 
  varnames = names(dfA_synthetic), 
  stringdist.match = names(dfA_synthetic), 
  partial.match = names(dfA_synthetic)
)

m_synthetic_2 <- fastLink(
  dfA = dfA_synthetic, dfB = dfB_synthetic, 
  varnames = names(dfA_synthetic), 
  stringdist.match = names(dfA_synthetic) ## , 
  ## partial.match = names(dfA_synthetic)
)

Then,

m_synthetic_1$matches$inds.a
[1] 3 4
> m_synthetic_2$matches$inds.a
[1] 2 3 4

From what I read from the manual and the GitHub README.md,

partial.match is another vector of variable names present in both stringdist.match and varnames. A variable included in partial.match will have a partial agreement category calculated in addition to disagreement and absolute agreement, as a function of Jaro-Winkler distance.

I understood partial.match to be returning an extra summary stats of some sort to show the degree of agreement on specified variables---something that should not change the match itself. Am I not correctly understanding the function?

p.s. This is something separate, and just a small suggestion, but I feel that data(samplematch) might not be the best data to demonstrate the strength of the package for those who first check out the package, because with the samplematch's dfA and dfB, you can simply call inner_join and get the same output much faster. Maybe a revised dataset, as follows:

dfA %<>%
  dplyr::mutate_if(is.factor, as.character) %>%
  dplyr::mutate(middlename = ifelse(lastname == "weatherspoon", NA, middlename))
dfB %<>% dplyr::mutate_if(is.factor, as.character)
matches.out.revised <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"), 
  cut.p = 0.8, threshold.match = 0.8
)
tedenamorado commented 6 years ago

Hi,

Thanks a lot for your interest in fastLink. I apologize it took me a bit to reply, I have been out of town.

As you rightly point out, the test data we have released with fastLink it is not the best given that it does not contain misspellings so a simple merge can do as well. We are planning to add some noise to the data so that we do justice to what fastLink does best.

Now, regarding your example: using partial matching across linkage fields is something you want to do when typographical errors are present in your data. When you set a list of variables for partial.matches, what you are doing is to make any comparison in that list of variables to take three values: agreement, partial agreement, and disagreement. As you may have noticed you received a warning message when using partial matches:

In gammaCKpar(dfA[, varnames[i]], dfB[, varnames[i]], cut.a = cut.a, : There are no partial matches. We suggest either changing the value of cut.p or using gammaCK2par() instead

The latter is due to the fact that in your data typographical mistakes are not too pervasive, so a model with two agreement categories performs better.

Please, if anything is unclear, just let us know.

All my best,

Ted

sysilviakim commented 6 years ago

Hi Ted, thank you for your reply.

I should have added more details to my first post, and I accidentally compared m_synthetic_1$matches$inds.a and m_synthetic_1$matches$inds.b while I meant to contrast m_synthetic_2$matches$inds.a---I apologize. I updated my initial post.

In my fake data,

> inner_join(dfA_synthetic, dfB_synthetic)
Joining, by = c("NameLast", "NameFirst", "Address", "City", "StreetName", "StreetSuffix", "MailAddress1", "MailAddress2", "Phone", "BirthDate")
# A tibble: 2 x 10
  NameLast NameFirst Address  City   StreetName StreetSuffix MailAddress1 MailAddress2 Phone BirthDate
  <chr>    <chr>     <chr>    <chr>  <chr>      <chr>        <chr>        <chr>        <chr> <chr>    
1 Lee      Joanna    400 W 5~ Lagun~ 5th        Rd           400 W 5th Rd Laguna Hill~ ""    02/02/19~
2 Choi     Jennifer  200 N M~ Pasad~ Main       St           200 N Main ~ Pasadena CA~ (626~ 04/04/20~

However, when you call partial.match, as in m_synthetic_1, fastLink tells me that observation 2 (Joanna) is not the same entity, while it shows correctly that observation 4 (Jennifer) is. What I was asking was why m_synthetic_1$matches$inds.a (wrongly) gives me 3 and 4 while m_synthetic_2$matches$inds.a correctly gives me 2, 3, and 4 as matches, a fastLink without the partial.match call.

I understand better now what partial.match does, but it still seems to be that whether we specify it or not, the output of the matches should be the same.

tedenamorado commented 6 years ago

Hi @sysilviakim,

Thanks for the clarification! I see what you mean, still when set the partial.match you are fitting a different model because while most of variables you are using are not partial matches, two of them have partial matches. Still what is tricky in your example is not the partial matches, but that in your data there are redundant fields. The problem with redundant fields is that they violate one key assumption in the model we use i.e., fields are independent conditional on the matching status.

For example, the following lines present an example that fixes the discrepancy:

m_synthetic_1a <- fastLink(
  dfA = dfA_synthetic, dfB = dfB_synthetic, 
  varnames = c("NameLast", "NameFirst", "Address", "City", "Phone", "BirthDate"), 
  stringdist.match = c("NameLast", "NameFirst", "Address", "City"), 
  partial.match = c("NameLast", "NameFirst", "Address", "City")
)

m_synthetic_2a <- fastLink(
  dfA = dfA_synthetic, dfB = dfB_synthetic, 
  varnames = c("NameLast", "NameFirst", "Address", "City", "Phone", "BirthDate"), 
  stringdist.match = c("NameLast", "NameFirst", "Address", "City"), 
##  partial.match = c("NameLast", "NameFirst", "Address", "City")
)

I finally started working on a paper on best practices. Issues like the one you raised will be at the core of such a project.

If anything is still unclear, please let me know. We are here to help!

All my best,

Ted

sysilviakim commented 6 years ago

Hi @tedenamorado,

Thank you for the second post. I see---I think I initially misunderstood your first post. I understand that partial.match results in a different model now. And I completely overlooked the conditional independence assumption in the paper Section 2.2.1! I guess I'll manually discard redundant variables at the moment because there aren't that many columns.

Thank you again.

aalexandersson commented 6 years ago

The functions compare.dedupe and compare.linkage in the R package RecordLinkage has the argument exclude that excludes fields/columns/variables. I prefer the fastLink syntax of only having to specify what you need for the linkage. But perhaps fastLink should return a warning message if there is a redundant field in the specified syntax?