Open tobwen opened 7 years ago
So the 2. Bickestraße
form has to do with a few things. First there's tokenization, which does not treat ordinal numbers in parts of continental Europe as a single discrete token, so becomes:
[('2', NUMERIC), ('.', PERIOD), ('Bickestraße', WORD), ('12', NUMERIC)]
A single period token on its own is ignored by the parser, so effectively libpostal doesn't see a difference between "2. Bickerstraße" and "2 Bickerstraße" at present. I'd need to think through the edge cases in other languages, but it's possible to add that case to the tokenizer, which would likely help elsewhere as well.
Second, something that was brought to my attention a while back is that English speakers who live in Germany, tourists, etc. will often write their addresses the English way as "12 Foostraße" instead of "Foostraße 12", so with some small probability in the training data we'll invert the order of road
and house_number
in most of the continental European examples. So seeing numbers on either side of the street in training can potentially create some confusion for the parser, especially if e.g. OSM typically uses Roman numerals for streets with ordinals.
Indeed it looks like it's II. Bickestraße in OSM, and that may be common throughout Germany. One thing we could potentially do to correct for that is to convert Roman numerals to Arabic numerals automatically some portion of the time in the training data so it sees both forms.
Separately, is writing city before street like Dortmund, 2. Bickestraße 12
common? We have an alternative format like that, but it mostly applies to post-Soviet states at present.
I'm gonna sit down tonight and investigate these special cases. There are even more examples like the above mentioned street. Incidentally, there are historical reasons for this spelling, which can be explained both by the amalgamation of municipalities and the separation of large roads.
The spelling 2. Biggestraße
isn't common, but it can happen if, for safety's sake, instead of the roman variant "II." a "2." was selected. Only think of non-roman based languages, which never heard of those funny vertical bars. Often you can find "Ii." (uppercase, lowercase) - we all know where that comes from, haha.
The NW government data set from OpenAddresses also uses forms like "II. Rote-Haag-Weg" (there are also some forms like Johannes-Paul-II.-Straße where using "2." would be even more rare, but for the purposes of the training data it's fine to replace all of these cases with an Arabic number a small fraction of the time, like 1-5% of cases, just enough that the parser has to deal with it).
A different case, but since we're talking about arabic numbers: there's also a Straße des 17. Juni in Frechen (and in Berlin of course).
In German street scheme, there can be leading numbers on streetnames, f.e.
II. Bickestraße, 44263 Dortmund
.libpostal messes up some things:
Shouldn't this be normalized to
"road": "2. bickestraße"
?Whoops, this got messed up :(
Oh no, also messed up.
Much better.
How can I help to fix this?