Closed riley-harper closed 2 months ago
I took a closer look at this. The error comes from inside the loop in the _explode()
function. On line 169 of hlink/linking/matching/link_step_explode.py, we run exploded_df.select(explode_selects)
. But explode_selects
contains all of the output column names, including exploded columns that we haven't created yet. So when there are multiple exploded columns, the first iteration of the loop throws an unresolved column error. Everything works as expected when there's only one exploded column, since we construct the column before selecting it out on line 169.
I checked out versions 3.5.0 and 3.4.0 and copied the failing test over into the same test file in those versions. The test fails with the same error message. So this seems to be a long-standing hidden bug, not something that has changed with our recent modifications to blocking (OR groups and multi_jaro_winkler_search).
In a configuration file with multiple blocking columns marked as "explode", hlink hits an error in Matching step 0 when it tries to explode the columns. It looks like the code is not handling this case correctly, although it's written in a way that's intended to do so. Maybe there is a problem with the loop in the
_explode()
function in hlink/linking/matching/link_step_explode.py.The error looks like this:
Notably, this code is working when there is just a single exploded column, and it looks like the Spark query plan includes the first exploded column even when there are multiple exploded columns. It's the second exploded column that is missing and causes the error.