kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Relaxing the conditional independence assumption? #82

Open zmbc opened 2 months ago

zmbc commented 2 months ago

S4 of the appendix of the fastLink paper describes two methods for relaxing the conditional independence assumption. It looks like a simulation study of these was done, but I don't immediately see any documentation on how to do this in fastLink. Does the code exist somewhere?

tedenamorado commented 2 months ago

Hi,

Sure thing. To implement the second approach discussed in that Appendix, the following code should work:

## Load the package and data
library(fastLink)
data(samplematch)

matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  cond.indep = TRUE
)

Since the first approach described in the Appendix produces similar results to the second, we focused on the second approach for implementation of fastLink. This is because it is a standard extension of the Fellegi-Sunter model.

Please, if anything, do not hesitate to let us know.

All my best,

Ted

zmbc commented 1 month ago

Thank you @tedenamorado!

I have a conceptual question about this that I'm wondering if you can shed some light on. It seems to me that S4.1 describes an approach that would modify both EM (parameter estimation) and prediction. It looks like the implementation of S4.2 affects only the parameter estimation. Is my understanding correct? Have you considered including interaction effects at prediction time?

tedenamorado commented 2 weeks ago

Hi @zmbc,

Both methods impact parameter estimation and, consequently, the predictions based on the Fellegi-Sunter model. The approach in S4.1 treats interactions as a linkage field comparison by combining the information from two different linkage fields. In contrast, the approach in S4.2 allows each linkage field and interaction to contribute to parameter estimation. While S4.2 complicates the EM algorithm because it is not based on closed-form solutions (though it is standard in the literature), S4.1 simplifies the EM algorithm but does not account for all possible two-way interactions.

Please, if anything, do not hesitate to let us know.

Ted

zmbc commented 1 week ago

@tedenamorado Apologies if I haven't been clear here. When I say "modify prediction" I mean the structure of the prediction model, not only the estimates of the parameters in it.

I'll give a concrete example. In S4.1 the agreement values of first name and date of birth are combined, so instead of having two agreement variables with 2 levels each, there is a single agreement variable with 4 levels (0-0, 0-1, 1-0, 1-1). If I understand correctly, this would result in m and u probabilities being independently estimated for each of the 4 levels. There is nothing forcing the match weight of 1-1 to equal the match weight of 1-0 + the match weight of 0-1. In fact it is probably quite a bit lower if the two agreements are correlated.

Then when predictions are made, match weights are applied according to the actual interaction pattern in each pair to predict. That is, a pair with 1-0 will get a fairly high match weight, as will a pair with 0-1, but a pair with 1-1 will get a match weight that is lower than the sum of those.

If I understand correctly, S4.2 does not do this. Instead, it would help when estimating the parameters for a match on first name or date of birth, likely lowering the match weight of both if they are correlated, but at prediction time the structure still dictates that the match weight of 1-1 = the match weight of 0-1 + the match weight of 1-0.