ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
11 stars 2 forks source link

Support blocking sections with multiple exploded columns #143

Closed riley-harper closed 1 month ago

riley-harper commented 1 month ago

Fixes #142.

This PR fixes a bug in the matching explode step which seems like it has been around for a long time. When a user set explode = true for more than one blocking column, they got an error like this:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `<exploded_column>` cannot be resolved.

This error happened in the loop in the _explode() function. The loop constructed each exploded column one-by-one and then selected it and the other columns out of exploded_df. The problem was that each iteration tried to select all of the blocking columns out of exploded_df, even though the loop hadn't run for all of the exploded columns yet. This is why the first iteration of the loop threw an unresolved column error.

To fix this, I've switched the loop to add the exploded columns to exploded_df one-by-one with withColumn(). This actually ended up simplifying the loop pretty substantially because we can focus on a single column at a time instead of handling all of the columns to select out with list comprehensions. The results are the same.

I found another possible bug when working on this. In the previous implementation, we selected all_column_names out of exploded_df if and only if there was at least one exploded column. The tests depend on this behavior. So I've added some logic to replicate the behavior and a comment explaining it. Changing this behavior is probably technically a breaking change because it changes the columns of the exploded_df_a and exploded_df_b tables. This might have ramifications for later tasks as well. One thing I did change is the order of the columns in exploded_df_a and exploded_df_b, since previously we were selecting with an unordered set. I've sorted the columns so that they aren't in a random order in the output tables.