ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Configurations with multiple exploded blocking columns cause errors in Matching step 0 - explode #142

Closed riley-harper closed 2 months ago

riley-harper commented 2 months ago

In a configuration file with multiple blocking columns marked as "explode", hlink hits an error in Matching step 0 when it tries to explode the columns. It looks like the code is not handling this case correctly, although it's written in a way that's intended to do so. Maybe there is a problem with the loop in the _explode() function in hlink/linking/matching/link_step_explode.py.

The error looks like this:

pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `<blocking_column>` cannot be resolved.

Notably, this code is working when there is just a single exploded column, and it looks like the Spark query plan includes the first exploded column even when there are multiple exploded columns. It's the second exploded column that is missing and causes the error.

riley-harper commented 2 months ago

I took a closer look at this. The error comes from inside the loop in the _explode() function. On line 169 of hlink/linking/matching/link_step_explode.py, we run exploded_df.select(explode_selects). But explode_selects contains all of the output column names, including exploded columns that we haven't created yet. So when there are multiple exploded columns, the first iteration of the loop throws an unresolved column error. Everything works as expected when there's only one exploded column, since we construct the column before selecting it out on line 169.

riley-harper commented 2 months ago

I checked out versions 3.5.0 and 3.4.0 and copied the failing test over into the same test file in those versions. The test fails with the same error message. So this seems to be a long-standing hidden bug, not something that has changed with our recent modifications to blocking (OR groups and multi_jaro_winkler_search).