Open lamquiem opened 1 year ago
https://fr.wikipedia.org/wiki/Poggio-di-Venaco says postcode is 20250. 2B238 seems to be the INSEE code ?
yes guessing that postcode format doesn't exist in the training data (you can type .print_features
in the address_parser cli and then try an address to see what the model is doing and where it might get stuck). Libpostal is not based on regex, other than to split strings into words. Using 20250 works for instance because it is a common postcode format and we also have some geographic context dictionaries which help identify postal codes from known geographic contexts (which probably include the 20250 version as well).
1 rue saint roch 20250 poggio-di-venaco
{
"house_number": "1",
"road": "rue saint roch",
"postcode": "20250",
"city": "poggio-di-venaco"
}
You can use a regex to extract/remove postcodes following that pattern and reparse the remainder, e.g. something like this will usually also work. If you're sending to Elasticsearch, you can just add the extracted postcode back in if needed for ElasticSearch purposes (postcode may be more selective than city, etc).
1 rue saint roch poggio-di-venaco
{
"house_number": "1",
"road": "rue saint roch",
"city": "poggio-di-venaco"
}
Hi!
I was checking out libpostal, and saw something that could be improved.
My country is France
Here's how I'm using libpostal
We use libpostal to parse addresses before searching with elasticsearch.
Here's what I did
parse_address('1 rue saint roch 2B238 poggio-di-venaco',language = 'fr', country = 'fr')
Here's what I got
[('1', 'house_number'), ('rue saint roch 2b238', 'road'), ('poggio-di-venaco', 'city')]
Here's what I was expecting
[('1', 'house_number'), ('rue saint roch', 'road'), ('2b238','postcode'), ('poggio-di-venaco', 'city')]
For parsing issues, please answer "yes" or "no" to all that apply.
If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result? no
Here's what I think could be improved
Is it possible to specify that French postcodes are of the form (\d[0-9aAbB]\d{3}) when parsing? The codes '2A' and '2B' correspond to the two Corsican departments in France. Openstreet map treats them as '20' but this is not the reality. Is it possible to set libpostal to recognise this form of regex ?