ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Allow 1 or 3+ columns as input in the array feature selection #134

Closed riley-harper closed 5 months ago

riley-harper commented 5 months ago

Right now the array feature selection only allows combining exactly two input columns into an output column. To make this more flexible, we could support passing any number of columns, with a minimum of 1. This should be a small change in hlink/linking/core/transforms.py, where we unpack feature_selection["input_columns"] with

col1, col2 = feature_selection["input_columns"]

The pyspark.sql.functions.array() function which we're using accepts a variable number of arguments.