openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 418 forks source link

Overlapping state_district from OSM #332

Open nlehuby opened 6 years ago

nlehuby commented 6 years ago

Hi!

I was checking out the awesome libpostal :heart: , and I have some questions.

I have noticed that, in many cases, there are multiple occurrences for state_district zone type in the yaml country files for OSM boundaries. For instance,

But in both cases, each admin_level represents a distinct zone type that aims to cover the whole country: admin_level 7 in France are a subdivision of admin_level 6 zones, it feel quite strange to make them both state_districts. The state_district level is not used in OpenCage address formatter for France, so it does not seems to create parsing errors. But it there a reason to keep both admin_level as state_district ?

Would you accept a PR to change the overlapping mapping is some countries ?

albarrentine commented 6 years ago

Bon(jour? soir?) Noémie,

Yes, at present we map everything that's between a city and a "state" (admin1) to "state_district", and it's sort of a catchall, since those admin levels tend to be used less frequently and are different around the world. The parser's machine learning model (a Conditional Random Field) has a quadratic term in the number of labels and it affects the size of the model as well, so we try to add new labels only as a last resort.

Re: OpenCage, their formats are designed more for localized display from something like OSM's addr:* key-value pairs, so most of the time a state_district tag would not be used unless there were no city information available. Our training addresses for libpostal use the OpenCage formats as the default, and then we make lots of random changes with certain probabilities to better represent the formats we'd see in the real world e.g. we might swap two components or invert the order of the address sometimes, etc. For estimating these probabilities I try to take cues from local users as much as possible, and fill in the gaps with research.

Generally there are always some issues with trying to normalize every country's admin hierarchy into a single schema. For the next version I've been considering the idea of using a single tag for place names, for the purposes of parsing, tagging each distinct place/boundary separately, and then having a separate method to resolve all of the place names in a string to something like Who's on First, GeoNames, or a user-supplied hierarchy. This would also allow us to more easily train on lots of different place name data from varying sources without having to put as much work into knowing whether the Arrondissements of Marseille are admin3 in one data set and city_district in another, etc. It would also remove some of the opinions that the parser may encode about places, make the model a little smaller, and reduce the rates of certain types of errors.

For v1.1, which I'll hopefully be able to return to soon, we have the notion of a "pseudo-component", which is where we can create an internal name like "japanese_minor_neighborhood" or "postal_city" and have special formatting for that within the Python side of libpostal that creates the training data, but when it's time to output the training example those components would alias to one of the existing tags like suburb or city (or in the proposed method above maybe we can just alias everything to place and only worry about component ordering).

Can you detail what the components should ideally be for France and/or any other countries that needed more specificity, how often each is used (we usually add the state_district level in some small proportion of the time), and what kinds of parsing errors come up as a result?

A PR works, or a description here, as you prefer. Many thanks for looking these over!