kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
272 stars 48 forks source link

nameReweight NA issue #54

Open EmericA570 opened 3 years ago

EmericA570 commented 3 years ago

Hello everyone,

Nice work with the package. It works well for me.

I just have a few question about reweighting posterior probabilities. After using nameReweight or just fastLink with nameReweight and firstname.field I only have NA in zeta.name. I don't understand why. I looked in the function and it should be because of that : 'matches.names.A$zeta.j.names[matches.names.A[,ind] != 2] <- NA'. But I don't understand it.

Also I would need to reweight using more than one field. I already did some modification but I wanted to know if there was any reason why you didn't do it.

In fact I realized that I'm not really of how to use the nameReweight function. Could you explain me ?

Best,

Emeric

tedenamorado commented 3 years ago

Hi @AuriantEmeric,

I hope all is well. Sorry for the late reply.

The name reweight function takes the empirical distribution of names and basically reweights matches according to the name frequency. This leads to common names being down-weighted and matching on infrequent names up weights the matching probability.

Our code, as it currently stands, can reweight probabilities based on one field. For example:

## Load the package and data
library(fastLink)
data(samplematch)

## The fastLink function only allows you do reweight one field at a time
matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  reweight.names = T,
  firstname.field = c("firstname")
)

## You can also reweight by last name
matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  reweight.names = T,
  firstname.field = c("lastname")
)

Now, to reweight by two fields, you would need to make further assumptions about the prevalence of names and last names. For example, if you were to assume first and last names are independent, then you can just multiply the matching probabilities after adjusting for first name frequency and the last name frequency counterparts.

If anything, please do not hesitate to reach out.

All my best,

Ted