libpostal for Indian Address

satyam-saxena commented 6 years ago

Hi,

I was trying to use libpostal python API for India addresses. Based on my observations, libpostal library is performing poorly for Indian addresses. Here's are some examples.

parse_address('Flat 1207, Hill Ridge Villas, ISB Road, Gachibowli') Out[11]:[(u'flat 1207', u'unit'), (u'hill', u'house_number'), (u'ridge villas isb road gachibowli', u'road')]

In [14]: parse_address('E2017, Nandan Kanan Villa ISB Road Gachibowli') Out[14]: [(u'e2017 nandan kanan villa', u'house'), (u'isb road gachibowli', u'road')]

Is there some changes I can do in the settings or can I add some data for the training which can potentially improve the performance specifically for Indian addresses?

Detailed descriptions will be really helpful, as I am new to this library.

albarrentine commented 6 years ago

The above errors may have more to do with structure than anything. Gachibowli is a suburb, and the current version of the parser was not trained with that many examples of using suburbs without a city, so adding "Hyderabad" to the end of the string improves both of those parses. Classifying "hill" as house_number also seems related to structure, like the parser really wants to see a house_number before it transitions to road in this case. There may be something structural we're not including in the training data (like maybe we throw out a class of examples that have addr:housename without addr:housenumber and that's what throws it off).

There are no country-specific settings, it's one global model.

The parser is trained on about ~120k Indian addresses, which does not even come close to covering all of India's population. Unfortunately, there's not much open data in India (if you know any sources of structured address data for India, I'm happy to add them to our corpus), and the OpenStreetMap data is of variable quality. As such, libpostal has relatively good coverage of toponyms like city names, etc. but lacks coverage of roads, addresses, venues, businesses, etc. Because addresses in India can be quite complex (we don't really have a system for generating landmark-based addresses from OSM at this time, nor is the landmark data even there most of the time), I would expect that libpostal does not do as well in India.

In general, a supervised machine learning model is only as good as its training data, so we probably just need more data for India to see the same level of performance there that we get in much of the rest of the world. As the data improves the parser will too. I'd recommend adding addresses that are not working for you to OSM, and we'll pick them up in the next build of the training data.

albarrentine commented 6 years ago

Also, since India is a big place, I shouldn't over-generalize. There are some cities and regions that have better mapping coverage than others

Here are the maps of addr:housenumber (proxy for how much address data there is) in some of India's largest cities:

You can see the house numbers are much less dense in Hyderabad than in other Indian cities with comparable population, so libpostal might perform better in regions that have more data (there are probably still a few toponyms here and there that we're not mapping correctly to our nomenclature though, and the labels aren't even consistent within OSM).

As we start to look at smaller cities, the data drops off very quickly:

And smaller than that, there's often no data at all

Karaikudi (zero addresses)
Tadipatri (zero addresses, few roads, not much building data)
Buxar (no data for most of the cities in the region)

Clicking around you can also see how inconsistently labeled the data can be in OSM in India. Sometimes addr:housenumber is e.g. "First Floor", which should have been labeled addr:floor instead, sometimes the labels for street and house name are switched, etc. For the upcoming v1.1 release, there's been some work around cleaning up/relabeling the training data a bit more. With a billion addresses it's ok to have some wrongly-labeled data in there, but we don't want libpostal to have to learn from too many inconsistent examples/patterns.

satyam-saxena commented 6 years ago

Hi,

Thanks for your detailed discussion on this topic. As you have mentioned, libpostal is trained with ~120k training samples from Indian addresses. In there any way, I can download that data for exploration.

-Thanks & Regards, Satyam Saxena

albarrentine commented 6 years ago

Yes. All the instructions for downloading the training data are in the README: https://github.com/openvenues/libpostal#training-data. Note: we're moving the training data to Internet Archive soon, so the URLs may change.

In the OSM address training set there are about 476k Indian examples, but that's because we create roughly 4 examples per OSM address.

If you know of any other data sets for India that are free/downloadable/have a relatively open license, please let me know, happy to add them.

dcai5000 commented 6 years ago

Is there anyway to get the model training program so that we can play with some non-public data sets to see how the model works?

albarrentine commented 6 years ago

@dcai5000 everything's open source including the training program address_parser_train which gets build on make and the data set is available on the Internet Archive. Note that I do not provide any support or help in the case of custom parsers as it's near impossible to debug an NLP model without knowing something about the input text. As such, I'd only recommend training custom models if someone on your team is familiar with NLP and sequence models like the Conditional Random Field we use, and knows C to look over the feature extraction pieces. With those caveats, #132 toward the end has a few details about how to do it.

openvenues / libpostal

libpostal for Indian Address #304