openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.09k stars 421 forks source link

strange parsing behavior on Russian addresses #138

Open linuzer opened 7 years ago

linuzer commented 7 years ago

Hi, I get some very strange results from libpostal on Russian addresses. My goal is to normalize OSM-addresses with libpostal, as well as my own addresses in order to geocode them with a simple string-match.

Consider for example this OSM-address:

street_name:              Центральная улица
housenumber:              24
city:                            
county:                   Заводоуковский городской округ
village:
state_region:             Тюменская область
state:                    Уральский федеральный округ
postcode:

According to the address-tamplate for Russia, I pull it together to the following address-string: Центральная улица 24 Заводоуковский городской округ Тюменская область Уральский федеральный округ

When I apply postal_parse(postal_normalize('...')) on that string, I would expect to get back essentially the same tags as above, just with "libpostal-normalized" strings.

But in fact I get back this:

house:             заводоуковскии
house_number:      городскои
road:              центральная улица
suburb:
city_district:
city:              округ
state_district:    тюменская
state:             область
postalcode:
country:

So "house", "house_number", "city" and "state" are certainly wrong. To me it looks like some entries in the qualifiers.txt for the Russian dictionary are missing. I added "округ" for example and recompiled, but it didn't change anything. How do I "apply" the changes?

In the attached file are some 100 more examples with lots of similar errors (although not all) sql.txt

Did I do something essentially wrong? Did I misunderstand something here? Could you help me please to improve the results?

Thanks a lot!

albarrentine commented 7 years ago

There's been a lot of work done on place names in a branch called parser-data which includes updates to both the model code itself and the training data it uses. That branch is still a work-in-progress and can't be merged into master yet, but there is an intermediate version of the new model that is backward compatible with master. That can be found at: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz. To use it (doesn't require switching branches or anything, it's the same model in master trained on more/better data), just unpack the contents of the tarball into $DATA_DIR/libpostal/address_parser where $DATA_DIR is whatever you passed in during configure, default is /usr/local/share.

Running the above address through the new model produces:

Result:

{
  "road": "центральная улица",
  "house_number": "24",
  "city_district": "заводоуковский городской округ",
  "state_district": "тюменская область",
  "state": "уральский федеральный округ"
}

Which I believe is correct for how we classify things. Urban okrugs are considered "city_district" in libpostal nomenclature.

The qualifiers.txt dictionary isn't that important anymore apart from being to randomly abbreviate things in OSM and to expand those abbreviations in something like postal_normalize. This is mostly because libpostal relies on memorization of place names out of OSM.

There are more improvements coming in the next full release, at which point this will all be in master, but I'd suggest just plugging in the new model for the moment.

Also I'd be careful with postal_parse(postal_normalize(address)[1]) because the results of normalization are unordered, so if there's an abbreviation that's ambiguous (maybe not as common in Russian?) it might be in position number 2. The results of expansion/normalization are meant to be treated like a set rather than an ordered list. But that may not matter as much for your use case.

linuzer commented 7 years ago

Thank you very much for the information!

I tried to use the new parser-data, but maybe I’m doing something wrong, but I still get the very same results. Obviously the new parser is not active. I use libpostal with the pgsql-postal module for Postgresql (9.5). I’m very convinced, that my $DATA_DIR is under ~/libpostal/data (at least the Makefile says so). I recompiled libpostal and pgsql-postal and I restarted the postgresql service, but it does not seem to work.

When I try the command-line parser, I still get:

Result:

{ "road": "центральная улица", "house_number": "24", "house": "заводоуковскии городскои", "city": "округ", "state_district": "тюменская область", "city": "уральскии", "house": "федеральныи", "state_district": "округ" }

Any idea what could be wrong?

kind regards

albarrentine commented 7 years ago

Hm, so you downloaded https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz, and unzipped it into ~/libpostal/data/libpostal/address_parser?

linuzer commented 7 years ago

Yes, here's the content:

~/libpostal/data/libpostal/address_parser$ ls -lah
insgesamt 759M
drwxr-xr-x 2 2002 1001 4,0K Dez 13 15:41 .
drwxr-xr-x 8 root root 4,0K Nov 29 16:43 ..
-rw-r--r-- 1 2002 1001 727M Jun  2  2016 address_parser.dat
-rw-r--r-- 1 2002 1001  11M Jun  2  2016 address_parser_phrases.trie
-rw-r--r-- 1 2002 1001  22M Jun  2  2016 address_parser_vocab.trie
linuzer commented 7 years ago

...sorry, wrong button :-((

albarrentine commented 7 years ago

Hm, that sounds like it's not pointing to the right data dir. Can you try rerunning ./configure --datadir=~/libpostal/data, then running make and sudo make install again and retrying in the client?

albarrentine commented 7 years ago

If I remember correctly, configure might need an absolute path rather than a "~".

linuzer commented 7 years ago

I did it completely: a config, providing the full path, make clean, make, make install. But unfortunately with the same result.

linuzer commented 7 years ago

to make sure that we talk about the same string: I use this address string: центральная улица 24 заводоуковскии городскои округ тюменская область уральскии федеральныи округ and am pasting it directly into ./address_parser. And it gives me this result:

Result:

{
  "road": "центральная улица",
  "house_number": "24",
  "house": "заводоуковскии городскои",
  "city": "округ",
  "state_district": "тюменская область",
  "city": "уральскии",
  "house": "федеральныи",
  "state_district": "округ"
}
albarrentine commented 7 years ago

Oh, that's a different string than what you pasted above.

I copied and pasted the first string, with accents: Центральная улица 24 Заводоуковский городской округ Тюменская область Уральский федеральный округ.

There is currently an issue with parsing place names without proper diacritics in languages that have them. This is mostly because OSM place names are almost always properly accented. The next version of the parser will train on both normalized and unnormalized versions of strings, but until then it will need the accents. For this reason, I would first parse the input, then normalize it. This isn't very convenient in Postgres as in a general purpose language, but it's what needs to be done currently.

linuzer commented 7 years ago

OK, but the result is almost the same:

Result:

{
  "road": "центральная улица",
  "house_number": "24",
  "house": "заводоуковский городской",
  "city": "округ",
  "state_district": "тюменская область",
  "city": "уральский",
  "state_district": "федеральный",
  "state": "округ"
}

Do you get a different one?

And when you say I should parse it first, you mean putting the different parts afterwards together according to the order given in the address template for Russia and pass it to the normalize function? But in this case libpostal would not complete the missing tags, would it? So the normalized string would not be "complete". How could I then "geocode" a custom address when it contains more (or less) parts than the OSM-address?

albarrentine commented 7 years ago

Ok. I'm actually getting that result on my Mac but not on my Linux machine, which may have used a slightly more updated codebase. Yesterday I merged a few commits into master that were definitely used when training the intermediate model, which seemed to eliminate any discrepancies in the models, but there may have been slightly more in there on the last training.

In particular, the intermediate model may have been trained using NFC unicode normalization, whereas master is still using NFD. That's a change that I can't merge into master right now because it would break the standard model (and while the intermediate model improves many things, it also breaks some of the tests).

I trained a quick Russia-only parser using a clean master checkout with the new training data, and the unnormalized address works fine with that build. That can be found at: https://libpostal.s3.amazonaws.com/parser_samples/parser_ru.tar.gz. It's smaller and is only trained on Russia, so might even work better for your use case, though I wouldn't rely on that link as the S3 directory structure may change with the new release (the plan is to publish a few smaller single-country or single-language models in addition to the global one).

For the normalized version it will probably have to wait until parser-data is ready to merge into master and become the default model, which will be at least a few weeks.

albarrentine commented 7 years ago

For normalization, the preferred way to do it is to first parse the user input into components, which returns basically a dictionary or some JSON, then normalize each component in that dictionary (potentially using different options for road vs. house_number vs. city). Again, that might not be very easy in Postgres, so may want to use libpostal from a general purpose language like Python and format simple SQL queries to send to Postgres rather than trying to do all this in SQL.

For geocoding you'll always have to work with missing data. A clauses like if normalized is null or db_field is null or normalized = db_field should suffice to make sure that none of the given fields differ, even though there might be some missing information.

linuzer commented 7 years ago

Thanks a lot for your great help! I will try this special parser tomorrow.

My Problem is that I have to geocode regularly several 100 thousand customer addresses in different countries. So I thought to get the OSM data from that country, extract the addresses into a database (have this already done) and normalize the OSM-addresses using libpostal. Then normalizing in the same way the customer addresses. After that I was hoping to be able to simply join together customer addresses with the OSM-geocodes. That's why I'm using a database.

So, if I understand you correctly, you're suggesting, I should first libpostal.parse the OSM-addresses, filling the results basically into several connected tables which I can libpostal.normalize then. But what I don't understand ist, libpostal.normalize needs the full address again, doesn't it? So how could I normalize let's say only the city, or the street?

Sorry, if this is getting too off-topic, you can also privately email me: nominatim@tscholz.net

albarrentine commented 7 years ago

There's a new global training set building now (the one I used for the Russia parser had a few issues with US and UK place names among others, so is being rebuilt). When that's done I'll train a new global model using the master branch, which should be better for accented languages.

postal_normalize, or in C expand_address, returns an array of values (which can be thought of as an unordered set), and every item in that set is potentially important. The query above uses only the first value in the array, which is fine in some cases (the abbreviation "ул." always means one thing: "улица") but not in others ("St" in English can mean "Street" or "Saint" and libpostal does not rank them, it just returns both permutations). Basically the query needs to be a set intersection rather than an equality test.

There are a few ways to handle things like this in a database. One is an "alternate_names" table a la GeoNames that has a one-to-many relationship with the base table. The other, in Postgres anyway, is to store everything as arrays of strings rather than single strings and use something like unnest to do the join (that wouldn't be terribly efficient because it can't use indexes). Or a combination of both. If the input addresses are temporary and smaller, it might be easier to just store arrays whereas the OSM table needs to be indexed and so might benefit from the alternate names approach.

linuzer commented 7 years ago

Thank you very much for your support! I'll try it when it is online.

linuzer commented 7 years ago

There's still a detail which I don't quite understand. You wrote:

For normalization, the preferred way to do it is to first parse the user input into components, which returns basically a dictionary or some JSON, then normalize each component in that dictionary

I did the first, parsed the OSM-Addresses into a table having the individual tags as columns. But what do you mean with "normalize each component in that dictionary"?

I could imagine (and please correct me if I'm wrong) I have to apply the address template from https://github.com/OpenCageData/address-formatting/blob/master/conf/countries/worldwide.yaml, which is - if I looked it up correctly for Russia:

generic10: &generic10 |
        {{{attention}}}
        {{{house}}}
        {{{road}}} {{{house_number}}}
        {{{suburb}}}
        {{#first}} {{{city}}} || {{{town}}} || {{{village}}} {{/first}}
        {{{state}}}
        {{{country}}}
        {{{postcode}}}

Then make a single string out of the components and pass it to the normalization function, saving the result as the basis used for the geocoding.

The sum of all returned libpostal-tags is: house, house_number, road, suburb, city_district, city, state_district, state, postalcode, country

In order to match them both together, I'm concerned on what to do with "city_district" and "state_district", since they don't appear in the template. If I through them away, I might loose valuable information which I could need later in the geocoding process. But how should I match them into the template? Or am I interpreting your explanation completely wrong?

Thanks for any help!

albarrentine commented 7 years ago

No need to worry about formatting at all. The normalize function (expand_address) can handle any UTF-8 string, doesn't need to be a complete address, could just be the street name or any other single component.

  1. Parse the address
  2. For each component, pass the string value unmodified into expand_address/postal_normalize (i.e. there should be N calls to expand_address, one for each component)
  3. Each call to expand_address will return a list of normalized strings for that component

Here's a quick implementation in Python:


from postal.parser import parse_address
from postal.expand import expand_address
from collections import defaultdict

normalized_components = defaultdict(set)

input = "123A Main Street New York NY"

for value, tag in parse_address(input):
     norm_values = expand_address(value)
     normalized_components[tag].update(norm_values)

This would produce (printing the value of normalized_components above as JSON):

{
    "house_number": [
        "123a"
    ], 
    "city": [
        "new york"
    ], 
    "state": [
        "ny", 
        "new york"
    ], 
    "road": [
        "main saint", 
        "main street"
    ]
}
linuzer commented 7 years ago

Thanks a lot for that detailed explanation an this outline, that cleared up a lot! I implemented this logic now in PostgreSQL, being able to parse and normalize into different tables the OSM-addresses as well as my query-addresses that should be geocoded. A (not so) simple query brings the matches together, sorting by the best matches.

The problem is, that almost nothing can actually be matched and the reason in most of the cases is, that the libpostal-parser seems still not recognizing very well the correct address parts - although I do see improvements in the version you checked in last week.

My question now is, should I better wait for the big upcoming release, or is it worth to put together some detailed examples that you could analyze?

I also wanted to extend the tests to another country, probably Germany, in order to see whether it is just because of some Russian "specialities".

albarrentine commented 7 years ago

So I'm clear:

  1. are the OSM addresses are also getting normalized? How are the OSM fields classified? It may be that there's some kind of mismatch between the OSM field and what it's called in libpostal, or you may have to check, say, a libpostal suburb against both suburb and city in OSM.
  2. Does the query allow some components to be NULL or does it require that every field in OSM also be in the input address? Sparsity can easily cause lots of non-matches, so using something like the clause described above will help: in this case it would be left join on a = b, then in the where clause you can specify combinations of joins that need to be not null for a valid match (house_number, street, and city for example).

Try to rule out whether the query can be improved first, QA a few hundred addresses and see if you can find some combination of fields that match. If the only factor is a parser mistake, then feel free to post some examples of the input address and the OSM address.

IIRC, the parser got very high (> 99%) held-out accuracy on Russian-language addresses. That's a measure of the parser's generalization i.e. how well it should do on addresses it hasn't seen before. If it's doing significantly worse than that, there might be something else wrong with encoding (easy to fix), or your addresses might look very different from the ones libpostal is trained on (might be able to improve the training data if it's not representative of real addresses), or there might be some obscure Russian place names that are just missing from OSM (in that case you'll need to add a bunch of places to OSM for it to work well).

I'd wait for an official release of the global parser for testing on Germany as well.

linuzer commented 7 years ago

Yes, I parse both, the OSM-addresses and my query-addresses and then normalize each component separately, storing the variations in separate tables that stay linked to the original address. And yes, of course, I do not require nor expect that an address matches on all its components, so yes, parts of it are allowed to be NULL. I just match what I can match.

I have put together now a set of about 100 addresses, attached in the csv-file. It shows the 4 stages of addresses. For clarification here's the order of the involved parts:

  1. osm_import: address how it comes from OSM (house_number; street_name; postcode; local_admin; city; state; country)
  2. osm_address_libpostal: OSM-address parsed, but not yet normalized (house_number; house; road; postcode; city; city_district; state; state_district; country)
  3. query_address_libpostal: the query-Addresses parsed, but not yet normalized (house_number; house; road; postcode; city; city_district; state; state_district; country)
  4. query_address: Query-Address imported (street; postcode; locality; city; district; region; country)

The rows are limited to the 5 best-matching rows per address.

The first columns show how the geocoder was able to match the normalized variations of each component together, indicating on what part it could find a matching address. If the part is in parenthesis, it means it did match ONLY that part (e.g. the postcode), but nothing else. If it is without parenthesis, it matched also the higher levels of it (e.g. to the postcode also the city or city_district, the state or state_district and the country).

So looking now on high level on the results, you can see that in the fast majority of cases I can only match a single attribute, although at the other side there are addresses where I can match the full address including the house_number (and everything above, state, city, road).

Looking a little more in detail, it turns out that the problem does not seem to be on the query-address side - and that's really surprising, because these addresses come directly from our customer's database, so really neither OSM nor any other address-database on the Internet could ever have seen that address! Rather the problem seems to be on the OSM side. You can pick out almost any address where it could match only one attribute and you find strange parsing results on the OSM side (comparing the first two columns).

So to me this seems to be clearly a parsing problem inside of libpostal. I would be happy already if I could match something over 50 % to the road (for comparison: Nominatim can match some 40 % of those addresses to the road level) - not talking about 99 %, which I think is simply impossible for really real-world addresses. Because on real addresses you have a hole bunch of other problems: How do you verify the response from any geocoder, if you don't know for sure where the address really is? And any address you find on Internet that has already a known geolocation assigned to it did get pass already through some type of normalization-process.

So I'm not really sure where the problem is, but I know libpostal is a very smart approach to geocoding and I would love to get better results. Do you have any clue?

Regards,

Tom

result.txt

albarrentine commented 7 years ago

Ok, I'm still not clear on what it is you're doing to the OSM addresses. It sounds like you're formatting the OSM addresses with OpenCage, then parsing them with libpostal, but this is not necessary since the OSM addresses come pre-parsed. So all that's needed is to map the OSM fields like addr:housenumber, addr:street, etc. into libpostal fields (house_number, road, etc.) and then only normalize them with no parsing.

Also I can't replicate your setup, and have no idea what the semi-colon delimited fields are supposed to be in that CSV. Can you send examples from the command-line client of addresses that libpostal gets wrong and what their expected parse should be so I can reproduce it?

If it's definitely something wrong with libpostal, I'll look into it, but 50% accuracy on OSM addresses in a well-represented language like Russian for a model that gets close to 99% held-out accuracy is just not very likely.

linuzer commented 7 years ago

Well, you can reproduce it, may be I wasn't clear enough, and I just realize that I used the semicolon also inside the columns, so yes, sorry, that's really confusing! The first row gives you the column headings, so there are 7 columns. The first 2 are just internal columns, ignore them. The third columns gives the geocoder matching-level and the last 4 columns shows the addresses, first the raw OSM-address, then the parsed OSM address and at the end the same for the query-addresses.

To reproduce: Take the string of the 4th column (it is enclosed in "") and pass it at the command-line to ./address_parser an you get the components that make up the string in the 5th column (also enclosed in ""). If you do this for rows where the 3rd column says "(postcode)" you see that the result gets parsed wrong.

linuzer commented 7 years ago

I thought the general procedure was already clear, but here it is:

I extract the addresses from russia-latest.osm.pbf using this https://github.com/kiselev-dv/gazetteer tool, which brings together all hierarchical parts of an OSM-address (street, postcode, city, admin-levels, state, etc). The result is a huge json-file, that I import into Postgresql using https://github.com/lukasmartinelli/pgfutter. That is what I put as "raw OSM-Address" in the attached file above. From there I parse the address with libpostal, then normalizing them.

Which tools are you using to extract OSM-addresses from pbf-files?

linuzer commented 7 years ago

Here's the same file again, but simpler, just 3 columns, semicolon separated and the addresses comma-separated. So it's the matching-level, the raw OSM-address and the parser result. result.txt

albarrentine commented 7 years ago

The OSM addresses seem to work perfectly on the command-line (using the master branch and the default model):

> Краснооктябрьская улица 40 городской округ Майкоп Майкоп Адыгея Южный федеральный округ 385000

Result:

{
  "road": "краснооктябрьская улица",
  "house_number": "40",
  "city_district": "городской округ майкоп",
  "city": "майкоп",
  "state_district": "адыгея",
  "state": "южный федеральный округ",
  "postcode": "385000"
}

> улица Гоголя 112 городской округ Майкоп Майкоп Адыгея Южный федеральный округ 385000

Result:

{
  "road": "улица гоголя",
  "house_number": "112",
  "city_district": "городской округ майкоп",
  "city": "майкоп",
  "state_district": "адыгея",
  "state": "южный федеральный округ",
  "postcode": "385000"
}

So it's just a question of what's different between the above and what's being passed in Postgres (and that part you'll have to figure out).

The larger point is also that parsing OSM addresses with libpostal shouldn't be necessary at all because it's already pre-parsed. I don't know what kind of formatting https://github.com/kiselev-dv/gazetteer (which was never mentioned before) is doing, so maybe their format does need to be parsed, but generally if the CSV you ingest has separate fields for street, house_number, city, etc. it should be easy enough to just map those fields to libpostal's fields and not use the parser at all.

In libpostal's ingestion we just take the raw OSM fields and map them to our schema using these mappings in master and these in parser-data. In parser-data there's a ton of additional preprocessing that goes on to try to mimic real-world addresses including abbreviations, adding sub-building information like unit, floor, staircase, etc. but that doesn't change the field mappings.

linuzer commented 7 years ago

OK, maybe I still have a deeper miss-conception here.

I'll take this example (not because I don't like the result above, just because in that example my query-address is of really poor quality, so there is really no better match possible. Btw, this one reason why 99% are just impossible, because lots of real-life address are of so poor quality that finding the correct match is like playing roulette... And of course, that's not libpostal's problem!):

OSM-Address: улица Антонова;185033;Петрозаводский городской округ;Республика Карелия;Северо-Западный федеральный округ

{
  "road": "улица антонова",
  "postcode": "185033",
  "city_district": "петрозаводский городской округ",
  "state_district": "республика карелия",
  "state": "северо-западный федеральный округ"
}

Normalizing the Road:

улица антонова
ulitsa antonova

Normalizing the City_district:

петрозаводскии городскои округ
petrozavodskii gorodskoi okrug
petrozavodskij gorodskoj okrug

Now the query-address: Антонова;185033;;Петрозаводск;;Карелия;Российская Федерация

{
  "road": "антонова",
  "postcode": "185033",
  "city": "петрозаводск",
  "state": "карелия",
  "postcode": "российская федерация"
}

Normalizing the Road:

антонова
antonova

Normalizing the City:

петрозаводск
petrozavodsk

If I now try to match the road and the city together, it fails, because it gets normalized in a different way. I can accept the differences in the city, because OSM provides only the city-district and I have only the city, this cannot match. But why gets the road parsed and/or normalized in a different way? Is this supposed to be correct? I thought (and maybe I'm wrong here), that libpostal takes care to normalize every part to a standardized string, so that I can match these standard-strings.

Sorry, if I sound annoying to you, that is really not my intention, I just want to understand it and use it in the best way possible.

Thank you for your help!

albarrentine commented 7 years ago

Ah, so the road in the query is "антонова" whereas in OSM it's "улица антонова".

In a full-text search engine that would be a strong match becasue a word like "улица" is very frequent and would have a low IDF score, but in terms of direct string comparison it doesn't.

IIRC, in Russia and many Eastern European countries, almost every street begins with "улица" so it's possible to omit it. Spanish has a similar convention with "Calle". Libpostal normalization doesn't directly handle the case of stripping very frequent prefixes, etc. just making sure that abbreviations and such normalize to the same thing. It might be possible to strip thoroughfare types in certain languages for the next release.

Two more immediate options could be:

  1. Convert the fuzzier fields like road or venue to Postgres full-text fields
  2. Strip words like "улица" and its variants off of the OSM addresses when doing equality comparisons so they'll match the query addresses.

I'm soon building an address/venue deduper that will, given two parsed addresses, try to return whether they're dupes or not and that will probably do a bit of spelling correction, dictionary word stripping, IDF scoring, etc. but it's not implemented yet.

linuzer commented 7 years ago

In a full-text search engine that would be a strong match becasue a word like "улица" is very frequent and would have a low IDF score, but in terms of direct string comparison it doesn't.

Yes, if I know I have to do it, I can handle it, but I just thought that was the purpose of libpostal...

Also, leaving out the parser completely is definitely an option, but even in OSM it is not always clear in which attribute what part of an address is stored... not talking of my customer's addresses where I need to do the parsing.

I'll give you a couple of other examples, where there might be a problem with parsing and/or normalization:

Query-Address:

Петрозаводск Карелия Российская Федерация

{
  "city": "петрозаводск",
  "state": "карелия",
  "postcode": "российская федерация"
}

The "postcode" is actually the country.

linuzer commented 7 years ago
Автолюбителейроезд Петрозаводск Карелия Российская Федерация 185013

{
  "house": "автолюбителейроезд",
  "city": "петрозаводск",
  "state": "карелия",
  "country": "российская федерация",
  "postcode": "185013"
}

The "house" is the road.

linuzer commented 7 years ago

That's kind of a special thing:

мост Белинского;Дворцовый округ;Санкт-Петербург;Северо-Западный федеральный округ

Result:

{
  "house": "мост",
  "road": "белинского",
  "suburb": "дворцовый округ",
  "city": "санкт-петербург",
  "state": "северо-западный федеральный округ"
}

The "house": "мост" means "bridge" and is therefore part of the street-name. ...not a huge problem!

albarrentine commented 7 years ago

Libpostal's goal is to help build "smarter, more international geocoders using the vast amounts of local knowledge in open geographic data sets." Geocoders usually do use a full-text search engine, usually set their own matching thresholds, etc. Libpostal makes it much easier for traditional search engines designed for text documents to handle addresses, which have many abbreviations, etc. but it's not a search engine and it shouldn't be expected to entirely solve every problem in geocoding. As I said, there's a new project starting soon for deduping addresses/venues which would handle the direct equality comparison case better.

The country case has to do with an error that's been fixed for the new release in libpostal's trie search and can affect certain multi-word tokens like country names that almost always occur at the end of the string.

For the other two, house/road is the most common class in libpostal's "confusion matrix" which details the types of errors the parser makes (i.e. on held-out training data it's never seen before, which is a way to approximate how well it will perform on real world addresses it will encounter in the wild, the most common error is predicting house when road was correct and vice versa). That's been true even in the new, improved models, and is probably the hardest part of address parsing as there are many venue names and roads not captured by OSM. The word "автолюбителейроезд" never occurs in OSM, so the model is basically taking an educated guess using its surrounding words, and "house" is not unreasonable in those cases. "мост белинского" only occurs in OSM for the bridge itself and we currently don't train on simple road names with no addresses, though that is planned. Any address that's on the other side of the bridge is going to use "улица белинского", not "мост белинского", so that's probably why it would initially predict house (from other bridges its seen where мост is the first token, which as far as I understand is uncommon in Russian, usually it's e.g. "Дворцо́вый мост") for the first token, then sees "белинского" and predicts road.

If OSM doesn't capture the real world in some way, it will probably be reflected in libpostal. Hopefully that's an incentive for people to continually work on improving OSM :-). The "10 minute rule" for libpostal should be: if you get a bad parse, try looking up the address in OSM. If the address does not exist, and there are no other addresses like it on that street, or that street doesn't exist, or that city doesn't exist, then the best recourse is to edit OSM. Libpostal's only ever going to be as good as its training data.

linuzer commented 7 years ago

OK, so then I know that I'm using libpostal the way it was intended to and right now it is as good as it gets. The remaining problems I have to take care somehow in a different way, which is absolutely OK - and of course I will contribute also to OSM, if I find problems there. Anyway, I'll stick to the libpostal-project, because you did a great job so far and I'm very confident that the next release(s) will further improve the overall performance and quality. By the way, do you have already an idea when the next big release will appear?

Thank you very much for your great help up to now, it pushed me forward quite a huge step and I did appreciate it a lot! If I get my geocoder to a reasonable quality, I might even publish it eventually.

Kind Regards, Tom

albarrentine commented 7 years ago

Hey Tom - libpostal 1.0 was released earlier today. The new parser has higher overall accuracy around the world, a more powerful machine learning model, and new tags for things like apartment numbers. It gets most if not all of the above examples correct:

> центральная улица 24 заводоуковскии городскои округ тюменская область уральскии федеральныи округ

Result:

{
  "road": "центральная улица",
  "house_number": "24",
  "state_district": "заводоуковскии городскои округ",
  "state": "тюменская область",
  "country_region": "уральскии федеральныи округ"
}

> улица Антонова;185033;Петрозаводский городской округ;Республика Карелия;Северо-Западный федеральный округ

Result:

{
  "road": "улица антонова",
  "postcode": "185033",
  "state_district": "петрозаводский городской округ",
  "state": "республика карелия",
  "country_region": "северо-западный федеральный округ"
}

> Петрозаводск Карелия Российская Федерация

Result:

{
  "city": "петрозаводск",
  "state": "карелия",
  "country": "российская федерация"
}

> Автолюбителейроезд Петрозаводск Карелия Российская Федерация 185013

Result:

{
  "road": "автолюбителейроезд",
  "city": "петрозаводск",
  "state": "карелия",
  "country": "российская федерация",
  "postcode": "185013"
}

> мост Белинского;Дворцовый округ;Санкт-Петербург;Северо-Западный федеральный округ

Result:

{
  "road": "мост белинского",
  "suburb": "дворцовый округ",
  "city": "санкт-петербург",
  "country_region": "северо-западный федеральный округ"
}

There are also a few Russian-specific things we add to the training data to help with parsing (probably doesn't affect OSM addresses as much, but does affect addresses from the wild):

pavel-avilov commented 2 years ago

@linuzer hi, can you tell me how you downloaded the russian model from https://libpostal.s3.amazonaws.com/parser_samples/parser_ru.tar.gz ? I don't have permission to download from this link. Or if you have it downloaded and you could share this model, I would be very grateful!