Closed boshkins closed 8 years ago
Ah. The state abbreviations commit (https://github.com/openvenues/libpostal/commit/89208120550cace14cee164464f3cff9a6f4faca) has to do with constructing training data for the address parser. Since OSM almost always uses state abbreviations (and country abbreviations), I expand them randomly with certain probabilities, add country names in the local language and other popular languages, etc.
For libpostal expand, toponym/place abbreviations will start to be included with the introduction of GeoDB to the build (includes state abbreviations, city alternate names, everything in GeoNames). I've held off on that thus far as GeoDB in its current form takes up a lot of space on-disk, but as it's a requirement for address parsing, I'm currently investigating a much more compact representation.
@thatdatabaseguy have you looked at Who's on First as an analog to GeoNames? Could be a more compact representation since there's a lot less cruft.
@riordan indeed, I have been following WoF and spoke with Nathaniel and Aaron about it when I was last out in SF.
At some point WoF may be the backing store for libpostal's geo disambiguation but there are a few things that are missing/sporadic in WoF:
As soon as WoF is up-to-par with GeoNames on these particular points, it should be worth the (not-so-trivial) effort to switch.
In the meantime, if we can label phrases in a string with GeoNames ids, it should be relatively easy to join to WoF for the interested user.
Also re: size, I've refactored the GeoDB and reduced its size down to something more reasonable such that it can be downloaded with libpostal for the address parser. It wasn't large because of cruft in GeoNames but because of how many keys we were storing in the on-disk db. I've now converted that to a memory-efficient trie for the keys, which share long prefixes, a sparse matrix for the potential resolutions of an entity feature and only store an id lookup table on-disk. Clocks in at ~500M, which is still hefty, but much more tolerable than the previous size (11G).
I've done some preliminary work integrating Who's on First with Wikidata entries, so we may be able to build a pipeline to cross-reference some of the missing elements you've identified from there, particularly for administrative regions.
awesome! :star2: Wikidata is the jam. That would certainly help with multilingual toponyms and authoritative names, assuming coverage is good. We only need the postal codes themselves, not necessarily the lat/lons, so doesn't matter if those come from GeoNames.
You may be interested in libpostal's name deduping (Python): https://github.com/openvenues/libpostal/blob/master/scripts/geodata/names/deduping.py. An example concrete implementation for neighborhoods can be found in https://github.com/openvenues/libpostal/blob/master/scripts/geodata/polygons/reverse_geocode.py.
After reviewing the literature and some trial-and-error, I found Soft-TFIDF (from this paper: https://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf) to be quite useful for doing approximate deduping of short named entities, including geographic entities. My implementation in libpostal has been used successfully to effectively create a concordance between OSM neighborhoods (where we reliably have localized/local script names) and Zetashapes/Quattroshapes, which is being used to augment the address parser.
This method could be useful for doing similar entity matching between WoF and Wikidata, or between GeoNames and any of the above. To reduce the number of comparisons that need to be made (so it's not N^2: this is sometimes known as "blocking"), I currently use R-tree and geohash-based indices since we're dealing with points and polygons. For names alone would need to do something like Minhashing shingles or making spellchecker-style edits a la Norvig and checking all the variations against the index, which is still O(1) in terms of the cardinality of the candidate set.
I'll take a look at concordances again at some point (and contribute them to WoF), but if this code is helpful in the meantime, feel free to use it.
@boshkins decided this was better implemented in the standard expansion dictionaries.
If you pull latest, state abbreviations for US, Canada (English and French) and Australia are handled by the expand API. Since many of the state abbreviations are potentially ambiguous with legitimate tokens, it should also return one version of the string which leaves the abbreviation alone..
Awesome!
On Dec 9, 2015, at 7:29 PM, Al Barrentine notifications@github.com wrote:
Closed #8.
— Reply to this email directly or view it on GitHub.
Al,
Does the library recognize the US state abbreviations? Apparently not:
However, I believe I have seen handling of state abbreviations mentioned in you commit comments. Could you clarify?
Thanks! Anatoly