Open mkaranta opened 7 years ago
We train on virtually every address in the Netherlands but I'm pretty sure the format used both in OSM and our Netherlands countrywide data set from OpenAddresses is "8495NB" with no space, so that's essentially the only form libpostal knows.
> Easterboarn 11, 8495NB Aldeboarn, Netherlands
Result:
{
"road": "easterboarn",
"house_number": "11",
"postcode": "8495nb",
"city": "aldeboarn",
"country": "netherlands"
}
I can separate them out randomly during training data generation for the next batch though.
That explains that.
I made most of the NL test cases using https://github.com/openvenues/address-formatting which introduced the space into the addresses (it was infrequently present before):
# Netherlands
NL:
address_template: *generic1
postformat_replace:
# fix the postcode to make it \d\d\d\d \w\w
- ["\n(\\d{4})(\\w{2}) ","\n$1 $2 "]
- ["\nKoninkrijk der Nederlanden$","\nNederland"]
I was under the impression that libpostal was trained on a subset of addresses formatted using this formatter, but perhaps I am wrong.
We do use address-formatting, but only the format templates themselves. The postformat_replace directives are more relevant to OpenCage's goals i.e. displaying a single representation of geocoder results in the format users would expect in each country, whereas for libpostal we're more interested in reconstructing multiple potential forms of user input so the parser can handle whatever it may encounter at runtime.
As such, we end up using a number of different formats for each country plus adding some components that don't exist in that repo such as unit/level, etc. but the formatting regexes are not quite as relevant for libpostal's case. We can add our own postprocessing configs as needed.
Hm, I also noticed here that the only country with postal codes problems in my tests was the Netherlands. It's about that space: '2331 SK'. About 30% of the use cases that we observe contain the space. Does it take a long time to re-train the model if I do it myself? Sorry, I'm new to the libpostal (but it's awesome!)
@antimirov yes it does take a fairly long time and a lot of computational resources to train the model (it's more about generating new training data, which is an involved process). Netherlands postcodes will be fixed in the 1.1 release, but I don't have a timeline on that currently as I'm involved with other projects.
Since Netherlands postcodes can be identified with a regex (see https://chromium-i18n.appspot.com/ssl-address/data/NL), I would suggest preprocessing the input for the Netherlands, removing the space, and then feeding the input to libpostal.
Still a problem.
Just for those who need a quick fix (as said by @albarrentine) (PHP)
// tools like DHL validation api expects "1234 AB"
$postcode = preg_replace('/(\d{4})\s?([A-Z]{2})/', '${1} ${2}', $postcode);
// before you call libpostal which expects "1234AB"
$addressString = preg_replace('/(\d{4})\s?([A-Z]{2})/', '${1}${2}', $addressString);
The Universal Postal Union publication for NL states:
6 alphanumeric characters (4 digits and 2 letters, with a space between the digits and the letters)
Hi @albarrentine!
I'd like to bring some attention to this old issue, in case there is something libpostal can do to support NL postal codes having spaces.
I have access to historical shipping & billing addresses coming from several e-commerce platforms: a quick survey of 2000 NL addresses shows a 50/50 split between "1234 AB" and "1234AB".
Interestingly, OSM only handles postal codes without a space: 1078GA works, but 1078 GA fails.
The Universal Postal Union publication for NL states:
6 alphanumeric characters (4 digits and 2 letters, with a space between the digits and the letters)
The major dutch postage provider (PostNL) also dictates a space in the postal code.
See: https://www.postnl.nl/versturen/brief-of-kaart-versturen/hoe-verstuur-ik-een-brief-of-kaart/brief-adresseren/ Image with translated text:
I've encountered an issue with parsing addresses from the Netherlands. Their postalcodes are usually of the form
"DDDD XX"
, where D is a digit and X is an alpha (source).Libpostal seems to classify the 2 suffix characters as a road.
This is consistent across my (small) test suite of 8,000 NL addresses.