moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.3k stars 147 forks source link

Explore blocking rules performance metrics #1054

Open RossKen opened 1 year ago

RossKen commented 1 year ago

Is your proposal related to a problem?

In general, it is difficult to assess the quality of your blocking rules. Other than looking at the number of records introduced by each of your blocks, we just hope that we have caught all of the potential matches.

The ONS Data Linking Journal Club had a session exploring the paper below:

2021_Dasylva__Estimating_the_false_negatives_due_to_blocking_in_record_linkage.pdf

This looks at a method to estimate the number of true matches being excluded by a set of blocking rules, which could be a useful metric when defining how tight blocking rules should be.

Describe the solution you'd like

This issue is not intended to result in a specific feature, but is to act as a prompt to explore the ideas in the paper above more thoroughly. If, after investigation, it feels like this is worth implementing then create a new issue with a specific output defined.

Questions to consider:

lamaeldo commented 4 months ago

A slightly more low cost way of evaluating this would be to compute the individual number of comparisons created by each blocking rule, and once we have run .predict(), to see how many pairs were accepted as a ratio of the number of pairs created. Would give a low cost metric to the user to see in which directions the blocking rules should be developped