ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Support OR conditions in blocking #138

Closed riley-harper closed 4 months ago

riley-harper commented 4 months ago

Closes #137.

This PR adds a new feature to the blocking section in the config file. Previously, all of the blocking tables were joined by ANDs to form the final blocking condition, like

a.BPL = b.BPL AND a.SEX = b.SEX

Now there is a new or_group attribute available on the blocking tables which users can use to group some blocking tables together into OR groups. These OR groups are joined together by ORs, not ANDs. This is helpful especially for situations where there are multiple variables that may contain the same information:

(a.BPL1 = b.BPL1 OR a.BPL2 = b.BPL2) AND (a.SEX = b.SEX)

By default, every blocking table gets put into its own OR group, so that the blocking condition is the same as it would have been before this PR. The matching.link_step_match.extract_or_groups_from_blocking() function has the logic for determining the OR groups from the input configuration. It returns a list[list[str]], where each sublist is an OR group. The potential_matches.sql template file has changed slightly to allow blocking_columns to be the new list of lists instead of a flat list.

riley-harper commented 4 months ago

Thanks for the review, Colin. I was also a little surprised that it didn't allow for OR. I think since you can do ORs in comparisons after blocking, we haven't needed this till now.