openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.07k stars 419 forks source link

Chinese toponym suffixes #343

Open chenkovsky opened 6 years ago

chenkovsky commented 6 years ago

parse_address("辽宁辽阳") # output[('辽宁辽', 'road'), ('阳', 'city')] actually '辽宁' is province '辽阳' is city

albarrentine commented 6 years ago

Related to: https://github.com/openvenues/libpostal/issues/71#issuecomment-347062545. All the names in OSM include the fully specified suffix so our training data only includes "辽宁省" at present, not "辽宁". For toponyms, the parser relies heavily on gazetteer features, whereas for the street-level components like road/house_number it works more like other types of sequence taggers in NLP, using more individual word and structural features to make predictions. Usually if it gets a toponym wrong that means the toponym was not found in OSM, OpenAddresses, GeoPlanet, etc. and either needs to be added manually or can be done programmatically on libpostal's side if there's a common naming pattern. In this case, for the next version, we can strip certain place suffixes like "省", "市", etc. some random percentage of the time so that the training data includes more examples like the above and the parser can include them in its dynamically-built gazetteers.

calvinzhan commented 6 years ago

@albarrentine Is it possible to do our own training for Chinese Address? We are trying to get much more addresses than OSM. It might get a better result.

albarrentine commented 6 years ago

Hi @calvinzhan, this has been discussed in a few other issues, but generally we don't encourage training custom models (and can't provide support for models trained on proprietary data sets as it would be difficult-to-impossible to debug a parser that's not performing as well as expected without knowing how the training data was constructed).

The entire training pipeline is open source and part of this repo, and the data set we use is available on the Internet Archive. If you or someone on your team is familiar with NLP/sequence models, it's possible to replicate the process on a different data set.