nameReweight NA issue - Githubissues

Hi @AuriantEmeric,

I hope all is well. Sorry for the late reply.

The name reweight function takes the empirical distribution of names and basically reweights matches according to the name frequency. This leads to common names being down-weighted and matching on infrequent names up weights the matching probability.

Our code, as it currently stands, can reweight probabilities based on one field. For example:

## Load the package and data
library(fastLink)
data(samplematch)

## The fastLink function only allows you do reweight one field at a time
matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  reweight.names = T,
  firstname.field = c("firstname")
)

## You can also reweight by last name
matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  reweight.names = T,
  firstname.field = c("lastname")
)

Now, to reweight by two fields, you would need to make further assumptions about the prevalence of names and last names. For example, if you were to assume first and last names are independent, then you can just multiply the matching probabilities after adjusting for first name frequency and the last name frequency counterparts.

If anything, please do not hesitate to reach out.

All my best,

Ted

kosukeimai / fastLink

nameReweight NA issue #54