tplagge / benefice

6 stars 2 forks source link

Address matching algorithm #4

Open derekeder opened 11 years ago

derekeder commented 11 years ago

Determine a quick, accurate way of matching addresses across multiple datasets.

This library seems like a potentially useful start: https://github.com/jjensenmike/python-streetaddress

From Forest: From what I remember reading in this area, there is no better approach than using a gazetteer (if available). For Chicago, we know all the street names and the their address ranges. https://data.cityofchicago.org/Transportation/Chicago-Street-Names/i6bp-fvbx

Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address.

That would standardize the street name, and often the direction.

If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go.

For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe.

This is more flexible and will tend to be much more accurate than regexp or similar tricks.

derekeder commented 11 years ago

see this thread for more info: https://github.com/maschinenmensch/edifice/issues/23