openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 421 forks source link

Postcode recognition in France #638

Open lamquiem opened 1 year ago

lamquiem commented 1 year ago

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is France


Here's how I'm using libpostal

We use libpostal to parse addresses before searching with elasticsearch.

Here's what I did

parse_address('1 rue saint roch 2B238 poggio-di-venaco',language = 'fr', country = 'fr')


Here's what I got

[('1', 'house_number'), ('rue saint roch 2b238', 'road'), ('poggio-di-venaco', 'city')]


Here's what I was expecting

[('1', 'house_number'), ('rue saint roch', 'road'), ('2b238','postcode'), ('poggio-di-venaco', 'city')]


For parsing issues, please answer "yes" or "no" to all that apply.

prigaux commented 10 months ago

https://fr.wikipedia.org/wiki/Poggio-di-Venaco says postcode is 20250. 2B238 seems to be the INSEE code ?

albarrentine commented 9 months ago

yes guessing that postcode format doesn't exist in the training data (you can type .print_features in the address_parser cli and then try an address to see what the model is doing and where it might get stuck). Libpostal is not based on regex, other than to split strings into words. Using 20250 works for instance because it is a common postcode format and we also have some geographic context dictionaries which help identify postal codes from known geographic contexts (which probably include the 20250 version as well).

1 rue saint roch 20250 poggio-di-venaco

{
  "house_number": "1",
  "road": "rue saint roch",
  "postcode": "20250",
  "city": "poggio-di-venaco"
}

You can use a regex to extract/remove postcodes following that pattern and reparse the remainder, e.g. something like this will usually also work. If you're sending to Elasticsearch, you can just add the extracted postcode back in if needed for ElasticSearch purposes (postcode may be more selective than city, etc).

1 rue saint roch poggio-di-venaco

{
  "house_number": "1",
  "road": "rue saint roch",
  "city": "poggio-di-venaco"
}