openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.01k stars 415 forks source link

Feature request: train the model with city names #114

Closed steveha-ziprecruiter closed 7 years ago

steveha-ziprecruiter commented 7 years ago

When my company parses location strings, the strings are usually city names. "San Francisco" might be a common example, or "Columbus, Ohio".

I've experimented with libpostal and found I get better results by prepending 1234 Main Street in front of location strings. It doesn't seem to cause any harm in the rare cases where we get a location string that does include a street address (the parse just returns multiple address parts).

I request that the training data include city names like "Columbus, Ohio" or "Paris, France".

albarrentine commented 7 years ago

In the parser-data branch of libpostal, there's been a fair amount of work done on creating simple place name queries, breaking up imperfect city name fields (e.g. when someone enters addr:city="Columbus OH", breaking off the "OH" from the city name), and ensuring that each place can have only one label that is used consistently. Those improvements will be in the next release.

There is an early version of the new model available at: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz, if you want to try it out. To use that (doesn't require switching branches or anything, it's the same model in master trained on new data), just unpack the contents of the tarball into $DATA_DIR/libpostal/address_parser where $DATA_DIR is whatever you passed in during configure, default is /usr/local/share I believe.

albarrentine commented 7 years ago

Hey @steveha-ziprecruiter, the libpostal 1.0 release is trained with all the city names in OSM and their parent admins (I've also made the training data public, feel free to download/grep through the place names training set if something's not working as you expect). The new parser should perform quite well on simple city queries. There might be one or two minor issues with certain multiword place names getting broken up. Will try to work that out as well for the next training batch.

steveha-ziprecruiter commented 7 years ago

Thank you! We have this new version in production now! ^_^

albarrentine commented 7 years ago

That's awesome! So for the place search box I presume?

steveha-ziprecruiter commented 7 years ago

Yes. When a job seeker wants to find a job, libpostal is part of the pipeline that converts what the job seeker typed into a location. So anyone who looks for a job on our web site is being helped by libpostal.