ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Simplify the `substring` column mapping transform configuration #146

Open riley-harper opened 2 months ago

riley-harper commented 2 months ago

The substring column mapping transform lets you extract substrings from a string column. Currently, its configuration looks like this:

transforms = [
  {type = "substring", values = [0, 4]}
]

The values array must have length 2. The first element is the starting index of the substring and the second element is its length. This would be simpler and more readable as two separate TOML attributes, like this:

transforms = [
  {type = "substring", start_index = 0, length = 4}
]

The first step would be to deprecate the old values array and support the new attributes. During this step, we would still accept the values array, but we would internally convert it to a start_index and length and print a deprecation message. Then in hlink v4.0 or whenever else we make breaking changes, we could drop support for the values array entirely and just support the separate attributes.