ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Make configuration more reliable with stronger typing #35

Open riley-harper opened 2 years ago

riley-harper commented 2 years ago

We've thought about making use of Rust for this, because the way it handles this stuff is abserdely cool. The more free parsing of Python may make this change hard to do in a completely backwards-compatible manner. Some TOML formatting issues that the Python parser handles might cause errors when parsed with Rust.

The general idea is to use maturin to build the Rust crate as a Python module, then call that module from existing Python to parse the hlink config file. We could use the Rust toml crate, which has support for defining the configuration as a Rust struct with derive macros. With some magic from the Rust serde crate, we can parse an enumeration like comparison types without any changes to the current configuration format.

riley-harper commented 1 year ago

I recently discovered the dacite Python package, which seems like just the tool to be able to do this parsing and validation in Python without needing to add in a Rust component. The typing isn't as strict as Rust, but we get to keep everything in Python. I think that dacite is a better choice than serde + Rust at the moment.

This issue is tied to the fact that we're using an old, out-of-date TOML parser, but it can be worked on separately. This layer goes from a dictionary to an internal representation of a configuration. The TOML parser goes from a file to a dictionary.