open-contracting / cardinal-rs

Measure red flags and procurement indicators using OCDS data
https://cardinal.readthedocs.io
MIT License
9 stars 3 forks source link

R044: More robust address matching #33

Open jpmckinney opened 1 year ago

jpmckinney commented 1 year ago

For example, dedupe (as I remember) applies address normalization (for at least US addresses). If we follow the same approach, we'd need to implement appropriate normalization for different jurisdictions. This strategy uses equality tests, but allows for some address components to be missing (e.g. "Main" vs "Main St"). I know Roberto Rocha recently evaluated a few different strategies when merging Canadian political donation datasets.

I think naive fuzzy matching will yield too many false positives (e.g. 1 Main St, Podunk, New York, USA 12345 and 100 Main St, ... are very close typographically, but are not at all the same address).

The first implementation could just do simple equality.

The metadata for this indicator should include a measure of similarity (percentage or otherwise).

jpmckinney commented 1 year ago

The prepare command could perhaps to address normalization.