openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

Inverse component order in German addresses #264

Open my13901 opened 6 years ago

my13901 commented 6 years ago

About parsing function, we should input address string like this: house number + street + state + city + country +postcode , It can returned correct result. If we don’t used that order to input address, the returned result are may not correct.

Here are examples:

Input address: “München Maximilian str. 5” , “81929 DE München Landshamer Straße 33”

The returned result :

[  {
        "label": "house",
        "value": "münchen"
    },
    {
        "label": "road",
        "value": "maximilian str."
    },
    {
        "label": "house_number",
        "value": "5"
}]

--> Not acceptable.

[  {
        "label": "house_number",
        "value": "81929"
    },
    {
        "label": "road",
        "value": "de münchen landshamer straße"
    },
    {
        "label": "house_number",
        "value": "33"
    }]

-->Not acceptable.

Input address: “Maximilian str. 5 München” , “33 Landshamer Straße München DE 81929”. The result :

[{
        "label": "road",
        "value": "maximilian str."
    },
    {
        "label": "house_number",
        "value": "5"
    },
    {
        "label": "city",
        "value": "münchen"
    }]

-->Acceptable

[{
        "label": "house_number",
        "value": "33"
    },
    {
        "label": "road",
        "value": "landshamer straße"
    },
    {
        "label": "city",
        "value": "münchen"
    },
    {
        "label": "country",
        "value": "de"
    },
    {
        "label": "postcode",
        "value": "81929"
    }]

--->Acceptable

Is there any way to solve this problem ? thanks

albarrentine commented 6 years ago

It does not need to be house_number + street (in your example above "Maximilianstr. 5" works fine and that is street + house_number) or any other American-esque format.

The formats libpostal trains with are specific to each country around the world and are defined in the address-formatting repo. These tend to be derived from something like the UPU (format for Germany: http://www.upu.int/fileadmin/documentsFiles/activities/addressingUnit/deuEn.pdf). According to that, the Deutsche Post, etc. the most commonly used format is: "Landshamer Straße 33, 81929 München DE" (which libpostal also handles).

Strangely the only other time I've seen a German address written with city before street (like "München Maximilian str. 5") is in another recent Github issue on this project (#258). Is it common?

In some of the former countries of the Soviet Union there's a format like this, which tends to be written more by older people who grew up in the USSR. For this case, in v1.1 there are two alternate formats that we use for 10% of the addresses in the training data in the 15 post-Soviet states:

                    {{{country}}}
                    {{{postcode}}}
                    {{{city}}}
                    {{{city_district}}}
                    {{{suburb}}}
                    {{{state_district}}}
                    {{{state}}}
                    {{{country_region}}}
                    {{{road}}} {{{house_number}}}
                    {{{house}}}

and

                    {{{country}}}
                    {{{postcode}}}
                    {{{city_district}}}
                    {{{suburb}}}
                    {{{city}}}
                    {{{state_district}}}
                    {{{state}}}
                    {{{country_region}}}
                    {{{road}}} {{{house_number}}}
                    {{{house}}}

We can do something similar for Germany to accommodate the alternate format in 1.1. What would be the correct ordering of components (assuming every possible component like state, state_district, city_district, suburb etc. is present, no matter if they're commonly written or not) for the alternate format?

Note: parser changes only take effect after the training data has been rebuilt, the model has been trained and pushed to S3, so this will have to wait for the 1.1 release. Switching branches does not change parser behavior or results because it will still be using the old model.

tobwen commented 6 years ago

Strangely the only other time I've seen a German address written with city before street (like "München Maximilian str. 5") is in another recent Github issue on this project (#258). Is it common?

I have seen this more often when typing into a geocoder. Some users enter the place name first to narrow down the search result