openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
3.99k stars 414 forks source link

Iranian address template is incorrect #243

Open mohsen3 opened 6 years ago

mohsen3 commented 6 years ago

The Iranian (Farsi) address template in https://github.com/OpenCageData/address-formatting/blob/master/conf/countries/worldwide.yaml is incorrect. Typically, addresses in Farsi are written in the format City/Neighborhood/Street/House while the one in the OpenCage format is the reverse (i.e., House/Street/.../City). I guess, this is the reason for the faulty parser since the generated training samples are incorrect.

How can I fix the problem? Is the OpenCage template pulled from its repo during the build process or is it copied to libpostal repository? I can probably make a pull request to OpenCage to fix the problem, but does it really fix the issue in libpostal as well?

Here is a few examples:

image

Inputs 26 and 28 are parsed correctly, but address is in reverse format. Inputs 25 and 27 are in the correct format, but parsed incorrectly.

albarrentine commented 6 years ago

Hi @mohsen3, thanks for pointing out the formatting issue. Always interested in hearing from people with local knowledge of the countries/languages we (attempt to) cover.

There are a couple of ways to fix that although since libpostal uses a machine learning model and takes some time to train, the changes would not show up immediately. However, we do automatically use the latest version of address-formatting during training, so a pull request to that repo will definitely affect the addresses libpostal generates. If you send them pull request in the next few days, the changes will make it into the next libpostal release, which I'm working on now.

One of the things I've added to address-formatting that might be useful is the ability to have multiple configs for each country depending on which language is being used. Examples where this is currently implemented are China, Japan, and South Korea, where the English format is the reverse of the format in the respective local languages. Something similar could be done for Farsi by adding an IR_fa and specifying an English-only format in a separate IR_en config if there are any formatting differences between the two languages.

Also, if there are multiple formats, libpostal has an internal formatting config for dealing with that case. In this config it's possible to specify rules for generating the training data like "move house before city 10% of the time for language x in country y".

mohsen3 commented 6 years ago

Thank you @thatdatabaseguy. I am currently reading through the second part of your Statistical NLP on OSM article and I noticed the extra config right after I posted the issue. By the way, the articles are very instructive. The extra config file is a great idea. But I am not sure how to adjust it to handle this case.

OpenCage's format for Iran is apparently wrong. But I am not sure what the right format would be. The official format suggested by Iranian post office is neighborhood/main street/secondary street/house. I guess, the easiest/fastest fix is to change Iran's template from generic17 to generic11 in OpenCage config file.

People, however, use their own format that typically starts with an important street or square/roundabout in the city and follows the path to the destination. It's just like telling you how to get to the destination starting from a well known place. So the path might be something like:

Tehran, Enghelab square, N Karger st., Forsat-e-Shirazi st., Parvin alley, No. 33

Squares/roundabouts play an important role in the Iranian city structures. You can find them even in the middle of addresses:

Main st., Second roundabout, 4th w. st., No. 55

  1. I am not sure if it is easy to generate such samples from OSM files
  2. Ideally, we should get a list of streets/squares as the parsed output but I am not sure if libpostal has such a capability at all.
albarrentine commented 6 years ago

Hm, so for OpenCage I think I'd recommend using the UPU format, and then whatever alternatives need to be added we can do on the libpostal side (building names, etc.).

That would look like:

IR: 
    address_template: |
        {{{attention}}}
        {{{house}}}
        {{#first}} {{{city}}} || {{{town}}} || {{{village}}} {{/first}}
        {{#first}} {{{suburb}}} || {{{city_district}}} || {{{neighbourhood}}} {{/first}}
        {{{road}}}
        {{{house_number}}}
        {{{state_district}}}
        {{#first}} {{{province}}} || {{{state}}} {{/first}}
        {{{postcode}}}
        {{{country}}}

Or if there's a more common format that's fine too. Would need to modify the Iran test case as well.

For lengthier street names, OSM does have the notion of a "parent street" (using the addr:parentstreet tag), which libpostal respects and will join the street names if that tag is present. However, that tag only appears to be used in the UK. It might be worth bringing up as a guideline in the Iran OSM community and adding that tag to some addresses. However, the parser should generally be able to handle multiple street names just fine. Even in the examples above, the parser can correctly identify the entire street name block, though it wouldn't split up each street, just identify that "enghelab square n karger st. forsat-e-shirazi st. parvin alley" is one continuous block of road tokens.

In the training data, each token is tagged with a simple class like road. In some NLP domains it's advantageous to label the tokens with something like IOB tags (e.g. Enghelab/B-road square/I-road N/B-road Karger/I-road st./I-road Forsat-e-Shirazi/B-road st./I-road Parvin/B-road alley/I-road) so the parser could identify the beginning of each distinct road name. However, since we don't have too many examples with multiple road names, that would almost certainly degrade performance in these cases because the transition I-road (continuation of a road name) followed by B-road (beginning of a new road name) would be very unlikely, so it would probably cause the parser to want to transition to a new label like B-house, and it also might break up road names in the wrong places (like predicting that "N" is a post-directional e.g. Enghelab square N).

freyfogle commented 6 years ago

Hi, Ed from OpenCage here. Happy to make whatever changes to address-formatting you guys think are correct for IR, please just submit a PR with relevant tests

albarrentine commented 6 years ago

@freyfogle sent a PR your way!

mohsen3 commented 6 years ago

Thank you guys for your response.

@thatdatabaseguy your PR was too fast :-D I was consulting with my colleagues about the issue. I also made an issue in OpenCage project this morning. As I mentioned in the other issue, the Iranian Farsi addresses are similar to the Korean ones (postal code comes last).

# South Korea - Korean
KR_ko:
    address_template: |
        {{{country}}}
        {{#first}} {{{state}}} {{/first}}
        {{#first}} {{{city}}} || {{{town}}} || {{{village}}} {{/first}}
        {{#first}} {{{suburb}}} || {{{city_district}}} || {{{neighbourhood}}} {{/first}}
        {{{road}}}
        {{{house_number}}}
        {{{house}}}
        {{{attention}}}
        {{{postcode}}}

The UPU format for English is probably our best bet. I didn't know such a thing exists.

albarrentine commented 6 years ago

@mohsen3 ok no worries, updated PR with https://github.com/OpenCageData/address-formatting/pull/40/commits/297134a6f8182983d754cd09912552e3160a8c4d

mohsen3 commented 6 years ago

Great! Thank you! I suppose this will greatly improve the parsing accuracy. Let's wait for the next version to see how it works.

About splitting multiple streets into pieces: I think this is important in some applications such as geocoding to have the names separated. I'll try to spend sometime on Iran's OSM to see if I can add the parent-street tag to at least a few streets (I am not actually familiar with OSM, so it may already have some such tags?).

albarrentine commented 6 years ago

@mohsen3 it should improve parsing of Iranian addresses in standard formats.

Adding addr:parentstreet wouldn't hurt, but even so, splitting street names is not really feasible in this project for four reasons:

  1. for the parser to be any good at splitting, we would need thousands and thousands of high-quality examples in many different cities
  2. it would make the model larger as now it has to distinguish between two different tags for street
  3. it would make the model less accurate in the rest of the world, where the vast majority of addresses that have street names only have one street name
  4. there's no guarantee it would even work well on cases like the above due to the ambiguities I mentioned before ("Enghelab square N").

As such I can't justify adding this to the model. What we can do for that case is to retain commas when they occur between phrases with the same tag, so you'd get road="Enghelab square, N Karger st., Forsat-e-Shirazi st., Parvin alley" and could split the results on commas if they were present in the input (if not, will have to rely on your search engine, create a gazetteer of valid street/square names yourself and do the lookups, or if there's a user interface you could provide some UI guidance if the address returns no results).