ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
11 stars 2 forks source link

Fix a bug in Training step 3 for categorical features #107

Closed riley-harper closed 10 months ago

riley-harper commented 10 months ago

Fixes #105. Thanks to @jrbalch543 for the fix.

This PR fixes a bug in Training step 3 - save model metadata. Previously we extracted a single coefficient for each feature, even when the feature was categorical. But categorical features are one-hot encoded, and each of their categories (or "levels") gets its own coefficient. So we now explode the categorical features in step 3 and save the correct coefficient for each of the categories for the feature. The actual columns are still one-hot encoded as normal. To prevent this sort of issue from popping up again in the future, we've set strict=True for the zip of feature names and coefficients. This ensures that these two lists are the same length.

To be able to calculate the number of categories for each feature, we now save the training_features_prepped table in the previous training step. This table includes all comparison features along with the results of the pre-pipeline, which are imputed and one-hot encoded columns. This table is hidden because it is generally not useful to users unless they know what they're looking for and really want to dig into what hlink is doing in the pre-pipeline.