openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

duplicate postcodes in Indian address #275

Closed aakashkag closed 6 years ago

aakashkag commented 6 years ago

parse_address("""Date of Birth: 22/07/1981 Current Address: C/O Mr. ds Singh , C3 / 364,sd No. 3 , 2ndPusta, Sonia Vihar Nr: Ramnath Model School, Delhi 110094""")

(u'2ndpusta', u'postcode')

albarrentine commented 6 years ago

So in general, it's best not to feed libpostal non-address text (like "Date of Birth", etc.) as the model is not trained on that kind of data. It's possible to train a model to handle this type of text, but those are not the choices we made in training this one. Here you could try a regex like "^.*?Current Address: (.*)" to extract the address if all of the records have the same structure.

Also note that here the parser does get the correct postcode as well:

  "road": "c/o mr. ds singh c3 / 364 sd no.",
  "house_number": "3",
  "postcode": "2ndpusta",
  "house": "sonia vihar nr ramnath model school",
  "city": "delhi",
  "postcode": "110094"

I'd just say it doesn't know what to do with "2ndPusta", which is two words, and this library does not have a spelling correction model (users can do that as a preprocessing step if they wish), so it will do the best it can with misspellings, but they may result in errors.

As far as I can tell "2nd Pusta" seems to be part of a block scheme in Sonia Vihar, but unfortunately that area of Delhi has not been mapped in much detail in OpenStreetMap, which is our only data source for India at present (if you know of others that with open licenses, please share), so even with the corrected version, it would classify those tokens as part of the house.

  "road": "c/o mr. ds singh",
  "house_number": "c3 / 364",
  "house": "sd no. 3 2nd pusta sonia vihar nr ramnath model school",
  "city": "delhi",
  "postcode": "110094"

I think this parse is probably the best we can do given the input, the complexity of the address, and the limited training data in the region.

aakashkag commented 6 years ago

Ok cool, what algorithm you guys using to parse address.?

albarrentine commented 6 years ago

The model is a linear-chain Conditional Random Field, my own implementation which incorporates word features, tag transitions and joint word-tag features to label a sequence of tokens X with a sequence of labels Y. With CRFs, the choice is between every possible sequence of labels vs. something like a Hidden Markov Model or Maximum Entropy Markov Model, both of which make the best local decision at each timestep. The linear-chain CRF makes inference tractable, as it's only quadratic in the number of labels, which is usually quite small. The training algorithm is the averaged perceptron, an error-based procedure which massively speeds up training in practice and creates sparser parameters without much loss in accuracy (though it does lose the ability to predict normalized probabilities, producing only scores/ranks).

The training data includes over 1 billion tagged token sequences for addresses around the world. This data set is publicly available for research, as detailed in the documentation.