openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.09k stars 421 forks source link

Netherlands - separating postcode into postcode & road #195

Open mkaranta opened 7 years ago

mkaranta commented 7 years ago

I've encountered an issue with parsing addresses from the Netherlands. Their postalcodes are usually of the form "DDDD XX", where D is a digit and X is an alpha (source).

Libpostal seems to classify the 2 suffix characters as a road.

 Easterboarn 11, 8495 NB Aldeboarn, Netherlands

Result:

{
  "road": "easterboarn",
  "house_number": "11",
  "postcode": "8495",
  "road": "nb",
  "city": "aldeboarn",
  "country": "netherlands"
}

This is consistent across my (small) test suite of 8,000 NL addresses.

albarrentine commented 7 years ago

We train on virtually every address in the Netherlands but I'm pretty sure the format used both in OSM and our Netherlands countrywide data set from OpenAddresses is "8495NB" with no space, so that's essentially the only form libpostal knows.

> Easterboarn 11, 8495NB Aldeboarn, Netherlands

Result:

{
  "road": "easterboarn",
  "house_number": "11",
  "postcode": "8495nb",
  "city": "aldeboarn",
  "country": "netherlands"
}

I can separate them out randomly during training data generation for the next batch though.

mkaranta commented 7 years ago

That explains that.

I made most of the NL test cases using https://github.com/openvenues/address-formatting which introduced the space into the addresses (it was infrequently present before):

# Netherlands
NL:
    address_template: *generic1
    postformat_replace:
        # fix the postcode to make it \d\d\d\d \w\w
        - ["\n(\\d{4})(\\w{2}) ","\n$1 $2 "]
        - ["\nKoninkrijk der Nederlanden$","\nNederland"]

I was under the impression that libpostal was trained on a subset of addresses formatted using this formatter, but perhaps I am wrong.

albarrentine commented 7 years ago

We do use address-formatting, but only the format templates themselves. The postformat_replace directives are more relevant to OpenCage's goals i.e. displaying a single representation of geocoder results in the format users would expect in each country, whereas for libpostal we're more interested in reconstructing multiple potential forms of user input so the parser can handle whatever it may encounter at runtime.

As such, we end up using a number of different formats for each country plus adding some components that don't exist in that repo such as unit/level, etc. but the formatting regexes are not quite as relevant for libpostal's case. We can add our own postprocessing configs as needed.

antimirov commented 6 years ago

Hm, I also noticed here that the only country with postal codes problems in my tests was the Netherlands. It's about that space: '2331 SK'. About 30% of the use cases that we observe contain the space. Does it take a long time to re-train the model if I do it myself? Sorry, I'm new to the libpostal (but it's awesome!)

albarrentine commented 6 years ago

@antimirov yes it does take a fairly long time and a lot of computational resources to train the model (it's more about generating new training data, which is an involved process). Netherlands postcodes will be fixed in the 1.1 release, but I don't have a timeline on that currently as I'm involved with other projects.

Since Netherlands postcodes can be identified with a regex (see https://chromium-i18n.appspot.com/ssl-address/data/NL), I would suggest preprocessing the input for the Netherlands, removing the space, and then feeding the input to libpostal.

cottton commented 4 years ago

Still a problem.

Just for those who need a quick fix (as said by @albarrentine) (PHP)

// tools like DHL validation api expects "1234 AB"
$postcode = preg_replace('/(\d{4})\s?([A-Z]{2})/', '${1} ${2}', $postcode);

// before you call libpostal which expects "1234AB"
$addressString = preg_replace('/(\d{4})\s?([A-Z]{2})/', '${1}${2}', $addressString);
enqueue commented 4 years ago

The Universal Postal Union publication for NL states:

6 alphanumeric characters (4 digits and 2 letters, with a space between the digits and the letters)

rochlefebvre commented 2 years ago

Hi @albarrentine!

I'd like to bring some attention to this old issue, in case there is something libpostal can do to support NL postal codes having spaces.

I have access to historical shipping & billing addresses coming from several e-commerce platforms: a quick survey of 2000 NL addresses shows a 50/50 split between "1234 AB" and "1234AB".

Interestingly, OSM only handles postal codes without a space: 1078GA works, but 1078 GA fails.

choeflake commented 1 year ago

The Universal Postal Union publication for NL states:

6 alphanumeric characters (4 digits and 2 letters, with a space between the digits and the letters)

The major dutch postage provider (PostNL) also dictates a space in the postal code.

See: https://www.postnl.nl/versturen/brief-of-kaart-versturen/hoe-verstuur-ik-een-brief-of-kaart/brief-adresseren/ Image with translated text:

image