openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.06k stars 418 forks source link

KZN abbreviation for KwaZulu-Natal, South Africa #234

Open coachwei opened 7 years ago

coachwei commented 7 years ago

Input: 11 Livingstone Road, Pinetown, Durban, KZN, 3610, South Africa

Parsing result: house_number: 11, road: livingstone road pinetown durban kzn, postcode:3610, country: south africa

it should be: house_number: 11, road: livingstone road, town: pinetown, city: durban, state: kzn, postcode:3610, country: south africa

albarrentine commented 7 years ago

Looks like the only issue there is the "KZN" abbreviation.

11 Livingstone Road, Pinetown, Durban, KwaZulu-Natal, 3610, South Africa works perfectly.

That abbreviation does appear to exist in OSM, but it uses a rare tag "name:short", which wasn't on the list of possible name tags. Some of the training data for v1.1 has already been generated, but since it's a state and doesn't require OSM edits, that name should make it in to 1.1.

There is no label for "town", only "city" which is used for towns, villages, and metropolises alike. Adding labels increases the model size, otherwise always try to fit place names into the existing categories, which cover most cases adequately. Since toponyms are based on dictionaries of known phrases, it might be possible to separate contiguous discrete phrases with the same label into different strings in the response (though many users of libpostal make the invalid assumption that there can be only one of each component, and try to format the output as a hash table rather than a list of tuples).