openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

telephone number support #99

Open ghost opened 8 years ago

ghost commented 8 years ago

Hello, i don't see telephone number support, what I'm trying is with Indian address and Indian numbers. How do we add a new secton, also for indian addresses there is an area for each city, can we train our model for such areas, if so, how?

albarrentine commented 8 years ago

Phone numbers have a much simpler structure than addresses and can be extracted/removed using a regex before parsing. Our tokenizer/lexer does include such a regex so phone numbers (might need the "+91" country code in this version) should get treated as a single token when they're encountered, but would get classified as one of the existing parser labels if they were part of the "address" input. I suppose we could create a phone_number label and just assign it to any token that matches that regex independently of the rest of the parser.

For city districts, neighborhoods, etc. there's a new version of the parser being trained (that work is going on in the parser-data branch) which has significantly better coverage around the world including recognition of most of the districts and neighborhoods in Indian cities like New Delhi or Chennai. If you want, you can look up the specific places you're considering in OpenStreetMap to see if they'll be covered in the next release. If not, it's easy to add places to OSM, and those changes will get picked up automatically on the next retraining.

ghost commented 8 years ago

Can i customise the international number regex in scanner.re, and will that come into effect?

Also can you please explain how to create labels, when the number in mentioned in the "address" input.

Thank you

albarrentine commented 8 years ago

Yes, you can customize it. After modifying scanner.re, you run make lexer - this takes a few hours of CPU time currently. That generates scanner.c, then re-run make for the changes to take effect.

Creating a new parser label, at least the way its being discussed here, requires modifying address_parser.c and might be a little more involved. I can look into it for the next release.