US state abbreviations - Githubissues

boshkins commented 8 years ago

Al,

Does the library recognize the US state abbreviations? Apparently not:

> ./libpostal "1 Main Street Reston VA 20333" en
1 main street reston vale 20333
> ./libpostal "1 Main Street Reston Virginia 20333" en
1 main street reston virginia 20333

However, I believe I have seen handling of state abbreviations mentioned in you commit comments. Could you clarify?

Thanks! Anatoly

albarrentine commented 8 years ago

Ah. The state abbreviations commit (https://github.com/openvenues/libpostal/commit/89208120550cace14cee164464f3cff9a6f4faca) has to do with constructing training data for the address parser. Since OSM almost always uses state abbreviations (and country abbreviations), I expand them randomly with certain probabilities, add country names in the local language and other popular languages, etc.

For libpostal expand, toponym/place abbreviations will start to be included with the introduction of GeoDB to the build (includes state abbreviations, city alternate names, everything in GeoNames). I've held off on that thus far as GeoDB in its current form takes up a lot of space on-disk, but as it's a requirement for address parsing, I'm currently investigating a much more compact representation.

riordan commented 8 years ago

@thatdatabaseguy have you looked at Who's on First as an analog to GeoNames? Could be a more compact representation since there's a lot less cruft.

albarrentine commented 8 years ago

@riordan indeed, I have been following WoF and spoke with Nathaniel and Aaron about it when I was last out in SF.

At some point WoF may be the backing store for libpostal's geo disambiguation but there are a few things that are missing/sporadic in WoF:

correct names: there are many instances in GeoPlanet, upon which Quattroshapes and WoF rely heavily, where the names are pretty far off from what a human would be expected to type (many of these instances are documented in whosonfirst-data). GeoNames I think currently has more standard names and is easier to edit/update.
language-namespaced toponyms: this is on the roadmap at WoF but GeoNames has better coverage, would need to join to GeoNames to get at the alternate names.
wikipedia links: again, on the WoF roadmap, but this helps significantly in determining whether a name needs to be qualified to stand in for an entity, and there are ~500K names resolving to Wikipedia in GeoNames currently.
postal codes: GeoNames has them for many countries. lat/lons aren't great but all we need are the postal codes themselves to help the address parser. Generally for numbers we replace all digits with a capital D in feature extraction, so 10013-1234 normalizes to DDDDD-DDDD, so we effectively count the "mask" as a feature instead of the more idiosyncratic digits. In some countries like South Africa, the postal codes are 4-digit numbers, which may be used in house numbers and postal codes alike. A postal code lookup pre-normalization allows us to use more discriminative features like "word i=DDDD and matches postal code gazetteer" in the model.

As soon as WoF is up-to-par with GeoNames on these particular points, it should be worth the (not-so-trivial) effort to switch.

In the meantime, if we can label phrases in a string with GeoNames ids, it should be relatively easy to join to WoF for the interested user.

Also re: size, I've refactored the GeoDB and reduced its size down to something more reasonable such that it can be downloaded with libpostal for the address parser. It wasn't large because of cruft in GeoNames but because of how many keys we were storing in the on-disk db. I've now converted that to a memory-efficient trie for the keys, which share long prefixes, a sparse matrix for the potential resolutions of an entity feature and only store an id lookup table on-disk. Clocks in at ~500M, which is still hefty, but much more tolerable than the previous size (11G).

riordan commented 8 years ago

I've done some preliminary work integrating Who's on First with Wikidata entries, so we may be able to build a pipeline to cross-reference some of the missing elements you've identified from there, particularly for administrative regions.

albarrentine commented 8 years ago

awesome! :star2: Wikidata is the jam. That would certainly help with multilingual toponyms and authoritative names, assuming coverage is good. We only need the postal codes themselves, not necessarily the lat/lons, so doesn't matter if those come from GeoNames.

You may be interested in libpostal's name deduping (Python): https://github.com/openvenues/libpostal/blob/master/scripts/geodata/names/deduping.py. An example concrete implementation for neighborhoods can be found in https://github.com/openvenues/libpostal/blob/master/scripts/geodata/polygons/reverse_geocode.py.

After reviewing the literature and some trial-and-error, I found Soft-TFIDF (from this paper: https://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf) to be quite useful for doing approximate deduping of short named entities, including geographic entities. My implementation in libpostal has been used successfully to effectively create a concordance between OSM neighborhoods (where we reliably have localized/local script names) and Zetashapes/Quattroshapes, which is being used to augment the address parser.

This method could be useful for doing similar entity matching between WoF and Wikidata, or between GeoNames and any of the above. To reduce the number of comparisons that need to be made (so it's not N^2: this is sometimes known as "blocking"), I currently use R-tree and geohash-based indices since we're dealing with points and polygons. For names alone would need to do something like Minhashing shingles or making spellchecker-style edits a la Norvig and checking all the variations against the index, which is still O(1) in terms of the cardinality of the candidate set.

I'll take a look at concordances again at some point (and contribute them to WoF), but if this code is helpful in the meantime, feel free to use it.

albarrentine commented 8 years ago

@boshkins decided this was better implemented in the standard expansion dictionaries.

If you pull latest, state abbreviations for US, Canada (English and French) and Australia are handled by the expand API. Since many of the state abbreviations are potentially ambiguous with legitimate tokens, it should also return one version of the string which leaves the abbreviation alone..

riordan commented 8 years ago

Awesome!

On Dec 9, 2015, at 7:29 PM, Al Barrentine notifications@github.com wrote:

Closed #8.

— Reply to this email directly or view it on GitHub.

openvenues / libpostal

US state abbreviations #8