openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.05k stars 416 forks source link

"Grd Floor" abbreviation #213

Open Jeeva-Ganesan opened 7 years ago

Jeeva-Ganesan commented 7 years ago

Hi, First of all, thanks for such a great contribution. This is an amazing project. We tried to use this for some address in New Zealand, unfortunately it didn't perform as we expected. There is this dataset - https://data.linz.govt.nz/layer/3353-nz-street-address/ which provides all the NZ addresses. We are thinking of training on this dataset. You also mentioned in some of the previous issues that you are working on making the pre processing and training pipeline automated. Is that available now? Or can you train on this dataset ?

albarrentine commented 7 years ago

Hey @Jeeva-Ganesan, I usually prefer to just make the global model better for everyone rather than have people train custom models. Usually if there are parsing problems at this point, it's either a "bug" in our training data which can be fixed or a pattern found in real-world data that's not in the training data, which can often be generated or otherwise remedied. Just tried a handful of NZ addresses and they seemed fine to me. Can you be more specific as to what didn't work well?

AFAICT that data set is part of OpenAddresses and so we train on the entire country of New Zealand, although the data set used by OpenAddresses appears to be a now-deprecated version, so needs to be updated. Also, we use OSM boundaries/cities/suburbs instead of those provided in the original data set because the "locality_suburb" field contains a mix of cities and suburbs, which are separate fields in libpostal. As such, it's more convenient to use OSM's classifications throughout so the parser doesn't get confused by different labels for the same object when mixing datasets. So possible that it might get a few city names wrong if they're not in OSM. Since New Zealand is made up of islands, it might also be a place that suffers from the "orphan coastal cities" issue (#203 and https://github.com/openvenues/libpostal/issues/27#issuecomment-292740044), which is being fixed for v1.1.

Jeeva-Ganesan commented 7 years ago

Thanks for your reply. Here

>>> from postal.expand import expand_address
>>> expand_address('2A Carlton Street Auckland')
[u'2a carlton street auckland']
>>> expand_address('53 cook street Auckland')
[u'53 cook street auckland']
>>> parse_address('53 cook street, New Zeland')
[(u'53', u'house_number'), (u'cook street new zeland', u'road')]
>>> parse_address('1104,53 cook street, New Zealand')
[(u'1104,53', u'house_number'), (u'cook street', u'road'), (u'new zealand', u'country')]
>>> expand_address('Grd floor 53 cook street auckland')
[u'garden floor 53 cook street auckland']

. In the first example, both addresses were not expanded at all. Its returning exactly what I typed. I was expecting at least the country name to be added with it. . In the parse_address, if the country name is misspelt, its not parsing it as a country. . In the third one, flat number and apartment number is returned as house number. . In the fourth example, ground floor is returned as Garden floor

albarrentine commented 7 years ago
  1. That's not how expand_address works. It only does things like expanding synonyms. It does not know anything about geography or have the ability to infer components. Adding country, other admin components, etc. should really be the job of a geocoder or something backed by an actual place database. Libpostal's parser can help with searching said database, and can deal with some of the ambiguities around "which Auckland does the user mean?" but only if, say, one Auckland is a city and the other's a suburb (which I don't is the case for Auckland).
  2. We don't do automatic spelling correction. For some things like road names libpostal can still get the correct parse even if a word is misspelled, but for place names like "New Zealand" they have to be spelled correctly and in our dictionaries. Users can implement their own spelling correction for out-of-vocabulary words as a preprocessing step if needed.
  3. "1104,53" is considered a single token when written without a space, so we can't break it up further. That's not such a common way of writing a flat number anyway (and even if it were, we couldn't possibly hope to know whether the right number or the left number should be house_number vs. unit in all countries, see #197) If you need to break it up further you can do it with a regex as a postprocessing step on libpostal's house_number field.
  4. Will add that abbreviation for ground floor. You can see in the dictionaries which abbreviations we currently have/use. If there's something missing, feel free to edit the text files and submit a PR (although the changes won't take effect immediately in the parser, needs to be retrained).
albarrentine commented 7 years ago

Adding the new street addresses layer in OpenAddresses: https://github.com/openaddresses/openaddresses/pull/3010, so that should make it in to 1.1. The suburb/city delineations are a little cleaner, which makes the source names usable with some simple logic. That would add more city/suburb names to supplement what's in OSM. Usually helps overall to get a more exhaustive list from a government source.