kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Looking for a way to feed threshold cutoffs to individual variables #66

Open ajw5296 opened 1 year ago

ajw5296 commented 1 year ago

Is there a way to set different cutoff values for certain variables. For instance, if the DOB variable between a potential match isn't above .9, then that wouldn't be considered a match, but all other variables have a cut off of .8.

tedenamorado commented 1 year ago

@ajw5296 if you are using the fastLink wrapper function, it is not possible (those cutpoints are global).

If anything, let us know.

All my best,

Ted

tedenamorado commented 1 year ago

@ajw5296 can you provide an example of what you have in mind here? Is your question about cutoff about how we compare variables or about the weight each variable receives when predicting the probability that two records are the same?

Looking forward to hearing from you!

Ted

ajw5296 commented 1 year ago

Hey @tedenamorado, my question is more about cutoffs, and if they can be set at a variable level, more preciously

  1. Are individual matching probabilities calculated within the fastlink method So a match might be something like .98(fname), .98(lname), .83(dob), and then these are calculated with their weights for the final whole posterior

  2. Can we set threshold cut offs for those individual variables in the method or through other methods. So despite fname and lname having a high probability, we would eliminate the potential match since the dob is below .9 (the respective cut off)

I suppose this is kind of a question about weights in a way, but I think the setting a higher weight for dob is methodologically different than setting a cutoff for dob. But if setting parameters for weights is easier, I'm interested in looking into it.

And just as a note, we looked into the stringSubset method, but since DOBs are shared values, it didn't really help us much.

Let me know if I can provide more info, thanks for your help!

aalexandersson commented 1 year ago

I do not think it is possible in fastLink other than maybe to create ad hoc linkage variables and then work directly with the corresponding gammas. A similar open issue is https://github.com/kosukeimai/fastLink/issues/49.

The Python-based splink has a similar open issue https://github.com/moj-analytical-services/splink/issues/434. The proprietary Match*Pro has "Classification Tab" with a user-friendly GUI for creating similar deterministic criteria.

For what it is worth, to me this seems of little use compared with other promised features under development such as probabilistic blocking and active learning.

tedenamorado commented 1 year ago

Hi @ajw5296,

As @aalexandersson mentions, it is not possible to set deterministic rules based on the probability of observing a specific agreement value for field k given that a pair of records is a match. The model learns these probabilities from the data.

Our focus is on the Probability that a pair of records is a match given the agreement pattern and the parameters of the model, which is a composite measure of the field-specific probabilities of observing an agreement value given that a pair of records is a match.

However, an alternative would be to pass your own set of parameters to fastLink. For example, we discuss how to pass parameters from a random sample of observations to a larger dataset here.

Please, if you feel we can be of further assistance, let us know.

All my best,

Ted