Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address.
That would standardize the street name, and often the direction.
If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go.
For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe.
This is more flexible and will tend to be much more accurate than regexp or similar tricks.
Determine a quick, accurate way of matching addresses across multiple datasets.
This library seems like a potentially useful start: https://github.com/jjensenmike/python-streetaddress
From Forest: From what I remember reading in this area, there is no better approach than using a gazetteer (if available). For Chicago, we know all the street names and the their address ranges. https://data.cityofchicago.org/Transportation/Chicago-Street-Names/i6bp-fvbx
Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address.
That would standardize the street name, and often the direction.
If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go.
For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe.
This is more flexible and will tend to be much more accurate than regexp or similar tricks.