openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

Jamaican Address #113

Open javidalkaruzi opened 8 years ago

javidalkaruzi commented 8 years ago

I am trying to use this library to parse addresses in Jamaica, However it does not parse these addresses well, in fact it never detects Jamaica as a country. It detects it as subhurb, house or joins it with a parish and label them a city. I am new to NLP but I don't mind contributing to the training for Jamaican addresses. A lot of data is already on OpenStreetMap about Jamaica but I will be adding as I see a need.

Sample addresses:

237 Old Hope Road, Kingston 6, Jamaica W.I 237 Old Hope Road, Kingston 6, Jamaica Arthur Wint Drive, Kingston 5, Jamaica 16 1/2 Windward Road, Kingston 2, Jamaica 54 Lyssons Road, Morant Bay, St. Thomas, Jamaica, West Indies Mt Salem, Montego Bay, St James, Jamaica Unit 1 Spanish Town Commercial Centre, Spanish Town, St. Catherine, Jamaica Unit 7, 1 Brumalia Rd., Mandeville, Manchester, Jamaica Lot 213 6 East, Greater Portmore, Portmore, St. Catherine, Jamaica

Also:

West Indies (W.I or WI) is optional Jamaica does not use zip codes

Kingston is a parish in Jamaica Kingston is a city and the capital of Jamaica Kingston (the city) is in the parishes of Saint Andrew and Kingston

Kingston 2, Kingston 5 and Kingston 6 above are postal codes. Currently only the city of Kingston have postal codes. There was a project to assign postal code to the rest of the island but that was suspended indefinitely in 2007. Any reference to postal codes for Jamaica such as JMAAW10, JMDCN24 and JMBMY13 were among proposed codes of that project and not in use.

Please let me know how best I can contribute.

albarrentine commented 8 years ago

Hi Javid,

Thanks for writing in, and for these examples.

When the original version of libpostal was trained (the version currently in master), we didn't have address formats for a number of countries, including Jamaica, so none of the Jamaican addresses in OpenStreetMap made it in to the first training set, hence why it doesn't currently even know what to do for the country name. The formats have been added in the address-formatting repo that libpostal uses, so Jamaican addresses are definitely being incorporated into the next training.

There's a massive new update coming soon from the parser-data branch, which has more comprehensive dictionaries to help with place names. Jamaican parishes in this case are mapped to "state," which is the common term libpostal uses for all first-level administrative divisions, but in any case, these should be parsed correctly when the new model is available

It looks like OSM does not have distinct parishes for Kingston and St Andrew, just the Kingston and St. Andrew Corporation. If it's important to be able to parse those names separately, the parish boundaries will have to be added to OSM, though from the addresses you posted, it doesn't seem like the parish is included anyway for Kingston.

Looking through the few addresses currently in OSM around Kingston, it looks like postal codes are inconsistently labeled. All of these variations occur in OSM:

The first version seems incorrect, and we can normalize that on the libpostal side without editing everything in OSM, just need to decide on a convention for what the correct labels should be. I think I'd prefer the 2nd and 4th versions (it seems like no one would write "Kingston, Kingston 5" so the city should probably be separated). There was only one reference to the "JMAAW10"-style postcodes in OSM, so no need to worry about that unless/until that system gets implemented. Not every postcode in Kingston was listed on an OSM address (including Trench Town's "Kingston 12", which I'm now realizing might be the first and only postal code used in popular music), so I added the "postal_code" tag to most of the neighborhoods in Kingston in OSM. As I understand it, "addr:postcode" is for buildings/individual addresses and "postal_code" is for postcodes that apply to areas, neighborhoods, etc. In the new model, we're able to use all of the named places in OSM as training examples for the parser, and if they are tagged with postal codes, those are used as well. This will help the parser recognize most place names even if there's not great address coverage (i.e. outside of Kingston and Spanish Town).

Not sure how libpostal should handle something like "Jamaica, West Indies". One option is to create a new address component that's larger than country (world_region or something like that) and add that component randomly in the training data for certain countries. This form is also used in other English-speaking Caribbean countries like St. Vincent and the Grenadines. The other option is to randomly append West Indies to the country name, so "Jamaica, West Indies" would be treated as a single country tag. Thoughts?

I will revisit the addresses in this ticket when the new model is available in a few weeks. I'll be pulling down a fresh copy of OSM this weekend, so feel free to add some places/addresses or make edits to OSM Jamaica and they'll be incorporated into the next release.

./al

javidalkaruzi commented 8 years ago

Hello,

I am in the process of mapping out distinct parishes for Kingston and St. Andrew. I should update OSM over the weekend.

You are correct in saying that no one writes Kingston, Kingston 5. The city and parish are usually omitted for those address.

Concerning Jamaica, West Indies, I think an entity larger than country should be created as the West Indies is indeed a region.

albarrentine commented 8 years ago

Right on. I'll pick up the next dump of OSM when it finishes and get the new parishes in there. For West Indies, I'm adding it randomly to 10% of the addresses in all Caribbean nations where English is the primary language. That should give the parser plenty of examples to work with. Note: these changes will only take effect once the new model is trained/deployed.

javidalkaruzi commented 8 years ago

On Saturday I added the parishes of Kingston and St. Andrew.

I noticed that you updated the parish and specified that Kingston Parish is the official name. We do not use the word Parish in the name. Unfortunately I think that convention is coming from Wikipedia in order to distinguish articles.

Also I noticed that the administrative levels on OpenStreetMap tagged the Jamaican parishes as counties and the counties as states. It should be the other way around. However unlike in the USA, for example, where states contains counties, in Jamaica the counties contains the parishes. As such I have left editing those for the time being.

albarrentine commented 8 years ago

Ah, my mistake. The Wikipedia names should still probably be included in at least one of the OSM name fields though ("alt_name" instead of "official_name" maybe). For libpostal it's helpful to list several alternate names, so that many different names will match the gazetteer. Even if people never write "Kingston Parish", some address data sets may come from a database that automatically assigns place names, or may be results returned from a reverse geocoder, etc. The default name displayed on maps, etc. will always be whatever's in the "name" field, and that's what will be used the majority of the time in libpostal's training data as well.

I think OSM just has admin_level 1-10, and each country's OSM community chooses how to assign those levels. The labelling in their search interface comes from Nominatim, which seems to use some US conventions.

In libpostal parlance, "state" just means whatever the first-level administrative division is and it's individually mapped per country. For Jamaica, parishes are states and for the counties I was planning on using one of our new tags, "country_region", which is a tag used for geographic regions or historical divisions of a country that don't have a current political or administrative function and are seldom used.

albarrentine commented 7 years ago

Hey @javidalkaruzi, libpostal 1.0 has been released into master and does much better on Jamaican addresses. Performs flawlessly on most of the above. Your Windward Road example above is used in the new demo and test cases as well.

The only problem I still see above is with some of the parish names, it does better with "Saint" instead of the "St" abbreviation. This is probably because we don't have too many OSM addresses for Jamaica, and the OSM names are "Saint Catherine", "Saint James", etc. We abbreviate toponyms some proportion of the time to avoid this, but I think in the parishes outside Kingston there are so few addresses that it never happens. It might be a good idea to add "short_name=St. Catherine", etc. to every parish boundary that frequently uses abbreviations so that we're guaranteed to get those names by hook or by crook.

javidalkaruzi commented 7 years ago

@thatdatabaseguy That is good news. I have tested it on some addresses here and I am happy with the results so far. I have added the short names for parishes and will start adding for other areas that uses them soon.

albarrentine commented 7 years ago

Sounds good sir! You might also be interested to know we've started importing a countrywide data set in OpenAddresses, another project I contribute to that gets imported by libpostal. https://github.com/openaddresses/openaddresses/pull/2301. It's not complete but has ~20k addresses around the country. Might be useful as an address point db for geocoding, etc.