ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Wrong input column for exploded blocking columns when expand_length not set #145

Open riley-harper opened 2 months ago

riley-harper commented 2 months ago

When working on #142, I noticed that in hlink/linking/matching/link_step_explode.py, if expand_length is not set for a blocking column, we run the following code:

explode_col_expr = explode(col(exploding_column_name))

However, the rest of the code treats exploding_column_name as the output column name and derived_from_column as the input column name. So I think there is a bug here. This should be

explode_col_expr = explode(col(derived_from_column))

instead unless I am misunderstanding something. This is probably a low-impact bug as you need to be blocking on an input column that is an array type to hit it. I believe that most exploded columns are integer columns with expand_length set.