openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 418 forks source link

US state abbreviations #8

Closed boshkins closed 8 years ago

boshkins commented 8 years ago

Al,

Does the library recognize the US state abbreviations? Apparently not:

> ./libpostal "1 Main Street Reston VA 20333" en
1 main street reston vale 20333
> ./libpostal "1 Main Street Reston Virginia 20333" en
1 main street reston virginia 20333

However, I believe I have seen handling of state abbreviations mentioned in you commit comments. Could you clarify?

Thanks! Anatoly

albarrentine commented 8 years ago

Ah. The state abbreviations commit (https://github.com/openvenues/libpostal/commit/89208120550cace14cee164464f3cff9a6f4faca) has to do with constructing training data for the address parser. Since OSM almost always uses state abbreviations (and country abbreviations), I expand them randomly with certain probabilities, add country names in the local language and other popular languages, etc.

For libpostal expand, toponym/place abbreviations will start to be included with the introduction of GeoDB to the build (includes state abbreviations, city alternate names, everything in GeoNames). I've held off on that thus far as GeoDB in its current form takes up a lot of space on-disk, but as it's a requirement for address parsing, I'm currently investigating a much more compact representation.

riordan commented 8 years ago

@thatdatabaseguy have you looked at Who's on First as an analog to GeoNames? Could be a more compact representation since there's a lot less cruft.

albarrentine commented 8 years ago

@riordan indeed, I have been following WoF and spoke with Nathaniel and Aaron about it when I was last out in SF.

At some point WoF may be the backing store for libpostal's geo disambiguation but there are a few things that are missing/sporadic in WoF:

As soon as WoF is up-to-par with GeoNames on these particular points, it should be worth the (not-so-trivial) effort to switch.

In the meantime, if we can label phrases in a string with GeoNames ids, it should be relatively easy to join to WoF for the interested user.

Also re: size, I've refactored the GeoDB and reduced its size down to something more reasonable such that it can be downloaded with libpostal for the address parser. It wasn't large because of cruft in GeoNames but because of how many keys we were storing in the on-disk db. I've now converted that to a memory-efficient trie for the keys, which share long prefixes, a sparse matrix for the potential resolutions of an entity feature and only store an id lookup table on-disk. Clocks in at ~500M, which is still hefty, but much more tolerable than the previous size (11G).

riordan commented 8 years ago

I've done some preliminary work integrating Who's on First with Wikidata entries, so we may be able to build a pipeline to cross-reference some of the missing elements you've identified from there, particularly for administrative regions.

albarrentine commented 8 years ago

awesome! :star2: Wikidata is the jam. That would certainly help with multilingual toponyms and authoritative names, assuming coverage is good. We only need the postal codes themselves, not necessarily the lat/lons, so doesn't matter if those come from GeoNames.

You may be interested in libpostal's name deduping (Python): https://github.com/openvenues/libpostal/blob/master/scripts/geodata/names/deduping.py. An example concrete implementation for neighborhoods can be found in https://github.com/openvenues/libpostal/blob/master/scripts/geodata/polygons/reverse_geocode.py.

After reviewing the literature and some trial-and-error, I found Soft-TFIDF (from this paper: https://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf) to be quite useful for doing approximate deduping of short named entities, including geographic entities. My implementation in libpostal has been used successfully to effectively create a concordance between OSM neighborhoods (where we reliably have localized/local script names) and Zetashapes/Quattroshapes, which is being used to augment the address parser.

This method could be useful for doing similar entity matching between WoF and Wikidata, or between GeoNames and any of the above. To reduce the number of comparisons that need to be made (so it's not N^2: this is sometimes known as "blocking"), I currently use R-tree and geohash-based indices since we're dealing with points and polygons. For names alone would need to do something like Minhashing shingles or making spellchecker-style edits a la Norvig and checking all the variations against the index, which is still O(1) in terms of the cardinality of the candidate set.

I'll take a look at concordances again at some point (and contribute them to WoF), but if this code is helpful in the meantime, feel free to use it.

albarrentine commented 8 years ago

@boshkins decided this was better implemented in the standard expansion dictionaries.

If you pull latest, state abbreviations for US, Canada (English and French) and Australia are handled by the expand API. Since many of the state abbreviations are potentially ambiguous with legitimate tokens, it should also return one version of the string which leaves the abbreviation alone..

riordan commented 8 years ago

Awesome!

On Dec 9, 2015, at 7:29 PM, Al Barrentine notifications@github.com wrote:

Closed #8.

— Reply to this email directly or view it on GitHub.