openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.06k stars 417 forks source link

[German] suburbs separated by hyphen #261

Open tobwen opened 7 years ago

tobwen commented 7 years ago

In German address, it's typical to add the destination suburb to the city it belongs to. It's seperated by a (hard) hyphen. Since that's not part of OpenStreetMap and others, you wasn't able to train this. But it's used on normal postal addresses.

Some near to real life examples (I'm dealing with some hundred addresses a week); Unna is the city, Hemmerde is its suburb.

  1. Hemmerder Hellweg 120, 59427 Unna => acceptable

    {
    "road": "hemmerder hellweg",
    "house_number": "120",
    "postcode": "59427",
    "city": "unna"
    }
  2. Hemmerder Hellweg 120, 59427 Hemmerde => acceptable

    {
    "road": "hemmerder hellweg",
    "house_number": "120",
    "postcode": "59427",
    "suburb": "hemmerde"
    }
  3. Hemmerder Hellweg 120, 59427 Unna-Hemmerde => fail :-1:

    {
    "road": "hemmerder hellweg",
    "house_number": "120",
    "postcode": "59427",
    "country": "unna-hemmerde"
    }
  4. Hemmerder Hellweg 120, 59427 Unna Hemmerde => fail :-1:

    {
    "road": "hemmerder hellweg",
    "house_number": "120",
    "postcode": "59427",
    "city": "unna",
    "country": "hemmerde"
    }
albarrentine commented 7 years ago

Ok. For the "Unna-Hemmerde" form, we don't split any tokens on hyphens in the parser (for parsing we take more of a "do no harm" approach). We can add examples like that to the training data for v1.1, but the full token has to be labeled as either city or suburb. Let me know which one it should be.

There's more wiggle room on the "Unna Hemmerde" case. The standard address format for Germany comes from address-formatting and we can make some random insertions for less-common formats on the libpostal side. Typically, suburb comes before city unless otherwise specified, so that can simply be inverted some percentage of the time. How common is this and are there any other countries where it applies?

tobwen commented 7 years ago

I've got a well-kept index of place names and their parts and settlements. I would like to make this available to the project. How can I send it to you?

The spelling "city-suburb" is common in Germany, although the postcode now intercepts many cases. The "Deutsche Post AG" has a very intelligent system to determine the correct destination on the basis of street names, city and suburbs as well as the name of the recipient. In large cities, the district is often omitted because the postcode is just as unique. However, a combination of city and district/suburb is quite common.

My previous work has always been concerned with the German state and its nomenclature. Unfortunately, therefore, I cannot provide any information on how it is handled in other countries.