For example, dedupe (as I remember) applies address normalization (for at least US addresses). If we follow the same approach, we'd need to implement appropriate normalization for different jurisdictions. This strategy uses equality tests, but allows for some address components to be missing (e.g. "Main" vs "Main St"). I know Roberto Rocha recently evaluated a few different strategies when merging Canadian political donation datasets.
I think naive fuzzy matching will yield too many false positives (e.g. 1 Main St, Podunk, New York, USA 12345 and 100 Main St, ... are very close typographically, but are not at all the same address).
The first implementation could just do simple equality.
The metadata for this indicator should include a measure of similarity (percentage or otherwise).
For example, dedupe (as I remember) applies address normalization (for at least US addresses). If we follow the same approach, we'd need to implement appropriate normalization for different jurisdictions. This strategy uses equality tests, but allows for some address components to be missing (e.g. "Main" vs "Main St"). I know Roberto Rocha recently evaluated a few different strategies when merging Canadian political donation datasets.
I think naive fuzzy matching will yield too many false positives (e.g. 1 Main St, Podunk, New York, USA 12345 and 100 Main St, ... are very close typographically, but are not at all the same address).
The first implementation could just do simple equality.
The metadata for this indicator should include a measure of similarity (percentage or otherwise).