Open chenkovsky opened 6 years ago
Related to: https://github.com/openvenues/libpostal/issues/71#issuecomment-347062545. All the names in OSM include the fully specified suffix so our training data only includes "辽宁省" at present, not "辽宁". For toponyms, the parser relies heavily on gazetteer features, whereas for the street-level components like road/house_number it works more like other types of sequence taggers in NLP, using more individual word and structural features to make predictions. Usually if it gets a toponym wrong that means the toponym was not found in OSM, OpenAddresses, GeoPlanet, etc. and either needs to be added manually or can be done programmatically on libpostal's side if there's a common naming pattern. In this case, for the next version, we can strip certain place suffixes like "省", "市", etc. some random percentage of the time so that the training data includes more examples like the above and the parser can include them in its dynamically-built gazetteers.
@albarrentine Is it possible to do our own training for Chinese Address? We are trying to get much more addresses than OSM. It might get a better result.
Hi @calvinzhan, this has been discussed in a few other issues, but generally we don't encourage training custom models (and can't provide support for models trained on proprietary data sets as it would be difficult-to-impossible to debug a parser that's not performing as well as expected without knowing how the training data was constructed).
The entire training pipeline is open source and part of this repo, and the data set we use is available on the Internet Archive. If you or someone on your team is familiar with NLP/sequence models, it's possible to replicate the process on a different data set.
parse_address("辽宁辽阳") # output[('辽宁辽', 'road'), ('阳', 'city')] actually '辽宁' is province '辽阳' is city