kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

Run times for field comparison variables #58

Closed marialma closed 2 years ago

marialma commented 2 years ago

Hi! I've been using fastLink to do some matching projects for work. I've noticed that some field comparisons take way way longer than others (maybe not even finishing?)

For example, on matching a datasets with ~ 300k entries with another dataset with ~1.4 million entries, name comparison takes ~ 10 minutes, but address comparison takes hours, both using gammaCKpar with the same cutoffs. Sometimes I just have to kill the address comparison completely.

Do you have any suggestions on why this might be taking so long, or what I could do to make it go faster?

aalexandersson commented 2 years ago

Disclaimer: I am a regular fastLink user, not a developer.

That's a huge record linkage which takes a long time to compute. Partial matching takes longer time than exact matching (default), and variables with much missing require more time to compare.

I recommend you to reduce the computation problem by using blocking, perhaps by blocking on gender and k-means block on first name with 2-5 clusters. For example, see this code from the fastLink Github page:

## Exact block on gender, k-means block on first name with 2 clusters
blockdata_out <- blockData(dfA, dfB, varnames = c("gender", "firstname"),
                           kmeans.block = "firstname", nclusters = 2)

Simply add more blocking until the linkage runs "fast enough". I guess that with nclusters = 5 fastLink will run in less than one hour for your two datasets, which would be fully acceptable to me.

If you still have a problem with the matching project, please provide more details such as the fastLink code you used.

Best, Anders

marialma commented 2 years ago

Thanks for the suggestion! I think I tried blocking earlier but some of the implementation was a bit confusing (mostly, my existing gender tags are not reliable, as one is the gender at birth and the other is just "M/F/Other"). I'll give it another shot!

aalexandersson commented 2 years ago

Typically, gender is a reliable variable and therefore used for blocking. But you might achieve enough blocking without using gender as a blocking variable.

You want to block only as little as is necessary to run the linkage sufficiently fast. Another blocking idea is window blocking on age instead of k-means blocking on first name. Again, the best blocking variable and method of blocking depends on which variable is more reliable.

The k-means and window blocking methods are often comparable but to find the best combination of blocking can be tedious and ad hoc. It is not always the case that partial matching is needed. That is, partial matching might require more blocking which could result in missed matches than if using exact matching with less blocking. For exact name comparison, consider using phonetic names such as NYSIIS or Double Metaphone. For exact address comparison, consider using standardized components such as "street number" and "street name" rather than the full address as one variable. Admittedly, the additional pre-processing requires more work which can be imperfect and tedious and it does not always improve the results. Another idea is to remove observations that are very unlikely to match but blocking is what works best in most cases for probabilistic record linkage on large datasets, in my experience.

The fastLink developers are working on making fastLink faster and more accurate in a couple of different ways such as "active learning" and "probabilistic blocking". In most cases, standard deterministic blocking still works well -- even for much larger datasets than you have here.

marialma commented 2 years ago

Thank you for your help! Since I was looking for the entries that didn't match, I ended up blocking against birth month, then taking all of the entries that didn't match and running those against the full dataset. My initial concern with the blocking was that there were missing values/ typos galore, but the second round of matching fixed that. Processing time went from ~ 5 hours total to ~ 1.5!