openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.02k stars 417 forks source link

UK locality + postal town handling #165

Open anshulsolanki opened 7 years ago

anshulsolanki commented 7 years ago

As i understand you parse the address in following components ( using OpenCageData-address)

The question is , how do you suggest to handle address like following

ARDENT WORLDWIDE LIMITED 13TH FLOOR ALDGATE TOWER 2 LEMAN STREET LONDON LONDON E1 8FA GBR LOTUS WORKFORCE SOLUTIONS LTD HILL VIEW COTTAGE HOLMES LANE HANBURY BROMSGROVE WORCESTERSHIRE B60 4HH GBR BARE ELECTRICAL LTD C/O ASCENDIS 2ND FLOOR 683-693 WILMSLOW ROAD MANCHESTER MANCHESTER M20 6RE GBR

in which there can be some components which do not fit the definition above

albarrentine commented 7 years ago

That's the subject of the parser-data branch, which will be ready to release shortly (along with the new models). The training set generated from parser-data is augmented with new tags like "unit", "level", "staircase", "entrance", "po_box" etc. which are rarely included in OSM addresses.

There is also an intermediate version of the new model that is backward compatible with master. That can be found at: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz. To use it (doesn't require switching branches or anything, it's the same model in master trained on more/better data), just unpack the contents of the tarball into $DATA_DIR/libpostal/address_parser where $DATA_DIR is whatever you passed in during configure, default is /usr/local/share.

albarrentine commented 7 years ago

@anshulsolanki libpostal 1.0 has been released into master. See the new README for all the tags, but it handles things like unit, floor/level, etc. There are still a few issues with UK addresses, mainly when there's a locality (e.g. Hanbury) and a postal town (e.g. Bromsgrove, also note that Bromsgrove does not contain Hanbury, though that shouldn't matter). Both map to "city" in libpostal, but in the UK there are very commonly two cities, whereas our templates only allow one. We solve in other cases like "state_district" in Russia by occasionally concatenating all the state_districts that an address could belong to. We don't currently do that for city, but something like this is probably needed for small towns in the UK.

With something like London London or Manchester Manchester, libpostal might also get confused because those names are not the correct or commonly-used names for the surrounding districts. It's London, Greater London and Manchester, Greater Manchester. You might be able to regex that out before sending the string to libpostal i.e. just replace "London London" with "London" and "Manchester Manchester" with "Manchester".