openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.06k stars 418 forks source link

[German] Leading numbers on streets (roman & arabic) #258

Open tobwen opened 7 years ago

tobwen commented 7 years ago

In German street scheme, there can be leading numbers on streetnames, f.e. II. Bickestraße, 44263 Dortmund.

libpostal messes up some things:

> II. Bickestraße 12, 44263 Dortmund

Result:
{
  "road": "ii. bickestraße",
  "house_number": "12",
  "postcode": "44263",
  "city": "dortmund"
}

Shouldn't this be normalized to "road": "2. bickestraße" ?

> 2. Bickestraße 12, 44263 Dortmund

Result:
{
  "house_number": "2",
  "road": "bickestraße",
  "house_number": "12",
  "postcode": "44263",
  "city": "dortmund"
}

Whoops, this got messed up :(

> Dortmund, 2. Bickestraße 12

Result:
{
  "road": "dortmund 2 bickestraße",
  "house_number": "12"
}

Oh no, also messed up.

> 44263 Dortmund, II. Bickestraße 12

Result:
{
  "postcode": "44263",
  "city": "dortmund",
  "road": "ii. bickestraße",
  "house_number": "12"
}

Much better.

How can I help to fix this?

albarrentine commented 7 years ago

So the 2. Bickestraße form has to do with a few things. First there's tokenization, which does not treat ordinal numbers in parts of continental Europe as a single discrete token, so becomes:

[('2', NUMERIC), ('.', PERIOD), ('Bickestraße', WORD), ('12', NUMERIC)]

A single period token on its own is ignored by the parser, so effectively libpostal doesn't see a difference between "2. Bickerstraße" and "2 Bickerstraße" at present. I'd need to think through the edge cases in other languages, but it's possible to add that case to the tokenizer, which would likely help elsewhere as well.

Second, something that was brought to my attention a while back is that English speakers who live in Germany, tourists, etc. will often write their addresses the English way as "12 Foostraße" instead of "Foostraße 12", so with some small probability in the training data we'll invert the order of road and house_number in most of the continental European examples. So seeing numbers on either side of the street in training can potentially create some confusion for the parser, especially if e.g. OSM typically uses Roman numerals for streets with ordinals.

Indeed it looks like it's II. Bickestraße in OSM, and that may be common throughout Germany. One thing we could potentially do to correct for that is to convert Roman numerals to Arabic numerals automatically some portion of the time in the training data so it sees both forms.

albarrentine commented 7 years ago

Separately, is writing city before street like Dortmund, 2. Bickestraße 12 common? We have an alternative format like that, but it mostly applies to post-Soviet states at present.

tobwen commented 7 years ago

I'm gonna sit down tonight and investigate these special cases. There are even more examples like the above mentioned street. Incidentally, there are historical reasons for this spelling, which can be explained both by the amalgamation of municipalities and the separation of large roads.

The spelling 2. Biggestraße isn't common, but it can happen if, for safety's sake, instead of the roman variant "II." a "2." was selected. Only think of non-roman based languages, which never heard of those funny vertical bars. Often you can find "Ii." (uppercase, lowercase) - we all know where that comes from, haha.

albarrentine commented 7 years ago

The NW government data set from OpenAddresses also uses forms like "II. Rote-Haag-Weg" (there are also some forms like Johannes-Paul-II.-Straße where using "2." would be even more rare, but for the purposes of the training data it's fine to replace all of these cases with an Arabic number a small fraction of the time, like 1-5% of cases, just enough that the parser has to deal with it).

tobwen commented 7 years ago

A different case, but since we're talking about arabic numbers: there's also a Straße des 17. Juni in Frechen (and in Berlin of course).