Open RossKen opened 1 year ago
A slightly more low cost way of evaluating this would be to compute the individual number of comparisons created by each blocking rule, and once we have run .predict(), to see how many pairs were accepted as a ratio of the number of pairs created. Would give a low cost metric to the user to see in which directions the blocking rules should be developped
Is your proposal related to a problem?
In general, it is difficult to assess the quality of your blocking rules. Other than looking at the number of records introduced by each of your blocks, we just hope that we have caught all of the potential matches.
The ONS Data Linking Journal Club had a session exploring the paper below:
2021_Dasylva__Estimating_the_false_negatives_due_to_blocking_in_record_linkage.pdf
This looks at a method to estimate the number of true matches being excluded by a set of blocking rules, which could be a useful metric when defining how tight blocking rules should be.
Describe the solution you'd like
This issue is not intended to result in a specific feature, but is to act as a prompt to explore the ideas in the paper above more thoroughly. If, after investigation, it feels like this is worth implementing then create a new issue with a specific output defined.
Questions to consider: