maschinenmensch / edifice

A database of the built environment in Chicago
5 stars 1 forks source link

Address matching algorithm #23

Open derekeder opened 11 years ago

derekeder commented 11 years ago

Determine a quick, accurate way of matching addresses across multiple datasets.

matthewgee commented 11 years ago

I'll do it! Pick me!

derekeder commented 11 years ago

related to #22

mccc commented 11 years ago

Out of ignorance, I'll chime in and just note that there are a wide variety of commercial applications, bordering on cheap (I've seen anywhere from $100-$1000), for standardizing (and thereby matching) addresses. While I assume this kind of thing has never been an open-source project due to boring maintenance aspects (keeping track of new construction, changed zip codes, etc.) I wonder if it wouldn't be possible to achieve a high degree of accuracy if we just limited ourselves to Chicago. This would of course depend on what address standardization tools the city departments are using internally -- does anyone know about this?

jpvelez commented 11 years ago

When he first assembled edifice, Cory used a bag of tricks to get addresses to match. Getting these from him would save a ton of time.

On Wed, Feb 13, 2013 at 12:36 AM, Michael Castelle <notifications@github.com

wrote:

Out of ignorance, I'll chime in and just note that there are a wide variety of commercial applications, bordering on cheap (I've seen anywhere from $100-$1000), for standardizing (and thereby matching) addresses. While I assume this kind of thing has never been an open-source project due to boring maintenance aspects (keeping track of new construction, changed zip codes, etc.) I wonder if it wouldn't be possible to achieve a high degree of accuracy if we just limited ourselves to Chicago. This would of course depend on what address standardization tools the city departments are using internally -- does anyone know about this?

— Reply to this email directly or view it on GitHubhttps://github.com/maschinenmensch/edifice/issues/23#issuecomment-13476479.

mccc commented 11 years ago

Yes, that is true, it is likely that Cory's brain and/or command history is a rich resource here. I also assume that he knows which datasets have the most reliable addresses (i.e. something that has likely already gone through a CASS certified converter) and which are really unreliable. This may have an effect in the order that we import data, as well. (e.g. is the address data in building_footprints good or bad?)

fgregg commented 11 years ago

From what I remember reading in this area, there is no better approach than using a gazetteer (if available). For Chicago, we know all the street names and the their address ranges. https://data.cityofchicago.org/Transportation/Chicago-Street-Names/i6bp-fvbx

Taking that as the gazetteer, the task is to find the standardized street name that is most similar to our query address.

That would standardize the street name, and often the direction.

If we had a source of trusted address or smaller resolution address ranges (maybe the building footprints?), then matching against that gazetteer is the best way to go.

For comparing the similarity of a query address to a target address I would recommend the Levenshtein distance or a modification like the affine-gap distance we use for dedupe.

This is more flexible and will tend to be much more accurate than regexp or similar tricks.

danxoneil commented 11 years ago

fwiw, the gazetteer approach was what we used at EveryBlock when we introduced address pages: http://blog.everyblock.com/2009/oct/14/addresspages/. I did a lot of user feedback monitoring/ response, and I never had anybody complain of mis-matching. It worked really well.

jpvelez commented 11 years ago

Via @daguar: Code for America's New Orleans team built a site called BlightStatus last year. It lets you look up whether a NOLA property is blighted and what the city is doing about it. The data originally came from a mess of disparate, super-dirty spreadsheets, so they spent a ton of time writing scripts to do address normalization. Check them out. CfA fellow @eddietejeda knows more.

Also here's a Ruby port of a Perl library that's supposed to be pretty good at address normalization.

fgregg commented 11 years ago

@danxoneil, do you know if any code from the everyblock gazateer ever made it across. If not, do you know if the person who worked on it would be willing to chat with me about the approach they took?

danxoneil commented 11 years ago

@fgregg https://github.com/paulsmith is yer man on dat.

fgregg commented 11 years ago

@paulsmith gave us some pointers

the EveryBlock code lives on in a project called OpenBlock. I would start with the geocoder we wrote:

https://github.com/openplans/openblock/tree/master/ebpub/ebpub/geocoder

especially the parser:

https://github.com/openplans/openblock/blob/master/ebpub/ebpub/geocoder/parser/parsing.py

The approach we took was roughly to parse the location into a set of possible matches, and then query the database to see which ones actually exist in the data.

https://github.com/openplans/openblock/blob/master/ebpub/ebpub/geocoder/base.py#L423

That should be a good starting point.

There's significant overlap between the parser and this python port of http://search.cpan.org/~timb/Geo-StreetAddress-US-1.03/US.pm

I think it's pretty interesting that the EveryBlock code did not do any on-the-fly fuzzy matching (there's a table of street misspellings that it looks at). We should follow up on why. Maybe EveryBlock just wanted to be very conservative and was willing to trade-off recall for precision.