openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

Czech address paring error #202

Open coachwei opened 7 years ago

coachwei commented 7 years ago

When using libpostal master to parse a Czech address:

Na Pankráci 1690/125, Prague 4, 14021, Czech Republic

The results are:

{ road: na pankráci , house_number: 1690/125, road: prague, house_number: 4, postcode: 14021, country: czech republic }

The result is incorrect. However, If i try the libpostal demo from mapzen.com(see below URL):

http://libpostal.mapzen.com/parse?address=Na%20Pankr%C3%A1ci%201690/125,%20Prague%204,%2014021,%20Czech%20Republic

The results are correct: [{"label":"road","value":"na pankráci"},{"label":"house_number","value":"1690 / 125"},{"label":"city","value":"prague"},{"label":"house_number","value":"4"},{"label":"postcode","value":"14021"},{"label":"country","value":"czech republic"}]

Is mapzen demo deploying a different version? how can we use the version deployed at mapzen? Thank you.

albarrentine commented 7 years ago

libpostal is an open-source project independent of Mapzen though they use it for some of their services. I'm not involved with the Mapzen API, so not sure which version it's using (feel free to ask on that repo). It was being upgraded to 1.0 last I heard.

In any case, neither of those parses is correct. "Prague 4" should actually be a city district.

This seems to have to do with the fact that we have the native Czech name "Praha 4" in our data but not the English name. This version works perfectly for instance:

> Na Pankráci 1690/125, Praha 4 14021 Czech Republic

Result:

{
  "road": "na pankráci",
  "house_number": "1690/125",
  "city_district": "praha 4",
  "postcode": "14021",
  "country": "czech republic"
}

The Czech city districts are not in OSM, which has the best multilingual names. We get them from Quattroshapes, which only has the Czech "Praha" versions.

We do have multilingual names in the GeoPlanet data but the Prague city districts were incorrectly classified as suburb, not city_district. There are different rules for suburbs/neighborhoods than city_districts. For instance, a city_district can be listed on its own in most countries, whereas a suburb/neighborhood must be listed together with a city, so that means the training data (at least using the English name) would always generate "Prague 4, Prague" and would probably not get that correct at runtime.

Can be easily fixed for 1.1.

coachwei commented 7 years ago

Thanks. would be great if 1.1 fixes this. I think generating the city name as we ll as the city_district is the perfect answer (city_district without city name can be difficult in many applications, i think).