kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Blocking strategy with millions of rows #59

Closed lucasmalherbe closed 2 years ago

lucasmalherbe commented 2 years ago

I’m trying to link files with millions of lines (10M, up to 50M), using blocking to reduce the number of pairs.

The fastLink tutorial handles blocking by applying the algorithm independantly on each block. However, in my case that’s not possible because my blocking variables have a lot of different values and some of the blocks are small. Moreover, if I choose to use several blocking variables and consider the union of all remaining pairs, the blocks will not necessarily be disjoint.

The other strategy showed in the tutorial consists in estimating the parameters on a sample of the whole cartesian product and then applying these estimates on each block. But the litterature highlights that parameter estimation via the EM algorithm is biased when the proportion of true matches becomes very small, therefore I’m not comfortable with doing this either.

I would like to block and then apply the algorithm on all the remaining pairs, but that doesn’t seem compatible with the way blocking is implemented in fastLink.

Do you think of any other blocking strategy here?

aalexandersson commented 2 years ago

Disclaimer: I (Anders) am a regular fastLink user, not a fastLink developer.

As far as I know, fastLink currently allows four different blocking strategies:

  1. exact,
  2. window,
  3. kmeans clustering,
  4. "approximate probabilistic"; it means a separate run of fastLink with a "loose" threshold of your choice such as 0.5 or 0.6.

The developers are very much aware of the blocking limitations in fastLink and have made some progress in improving the blocking capabilities. In general, the most promising blocking strategy seems to be probabilistic blocking. For a reference, see Enamorado and Steorts (2020). At the same time, currently "approximate probabilistic" blocking is the most difficult to achieve, especially when you have small blocks.

In my experience, you can easily combine exact, window, and kmeans clustering blocking. You can also create your own blocking variables in R to be used in fastLink. For example, you can block on a subset of a variable to avoid too small blocks. Therefore, you have plenty of blocking options. I have never found blocking to be a real limitation in fastLink in terms of enabling a fast enough run. You just need to block enough. But I admit that blocking in fastLink is ad hoc and that it can be a hassle. A small overlap (and much missing etc.) can be a problem, as you mention. Therefore, I wish there will be a fastLink update soon with user-friendly probabilistic blocking (and with active learning for faster clerical review). In the meanwhile, until the next update is available, there is reason to be optimistic that blocking in fastLink will also be good enough for you. Which blocking and linkage variables do you have, and which combination did you try?

lucasmalherbe commented 2 years ago

Thanks for your reply.

Can you elaborate on probabilistic blocking? On which datasets do you run fastLink with the loose threshold? Because it won’t run on the whole datasets.

I have the following linkage variables: given name, surname, date of birth, postcode and address. I’d like to use postcode and birthyear as blocking variables and keep the union of all pairs from these two blocking steps but I didn’t find how to do that using fastLink blockData function.

aalexandersson commented 2 years ago

For now, I think you must use exact, window or kmeans blocking here to reduce the number of pairs for fastLink.

How many blocks is the result if you block exactly on postcode and birthyear? How large is the largest block, and how small is the smallest block? Can fastLink run on the smallest block? Can fastLink run on the largest block?

## Exact block on postcode and birthyear
blockdata_out <- blockData(dfA, dfB, varnames = c("postcode", "birthyear"))
lucasmalherbe commented 2 years ago

The issue I'm facing is that the blockData function only allows to block together records for which postcode AND birthyear are equal but I want to block records for which postcode OR birthyear is equal, otherwise I will get too many false negatives due to blocking. My understanding is that the blockData function doesn't allow this type of blocking, but it would be useful.

aalexandersson commented 2 years ago

Yes, "OR blocking", which also known as "indexing by disjunction", is not available in fastLink. I agree it would be useful. It is available in several well-known record linkage software such as AutoMatch, Link Plus and Match*Pro. But probabilistic blocking would be more useful because it can propagate uncertainty into the record linkage.

Maybe you can first aggregate rare values of birthyear and postalcode? Alternatively, maybe you can block only on common values of birthyear and postalcode? Either way should result in fewer but larger blocks which should reduce the false negatives.

lucasmalherbe commented 2 years ago

I'll keep these ideas in mind, thank you!