openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.06k stars 418 forks source link

PO Boxes France #172

Closed sragneau closed 7 years ago

sragneau commented 7 years ago

Dear all,

I'd like to know if there is any solution to get PO boxes parsed with French Language ? Here you can see two cases with my expected results.

For the moment, po boxes are detected as road with house number.

Thanks in advance for your response.


Address BP 287 86007 POITIERS CEDEX

result expected
{
po_boxes : BP 287
postcode: 86007
city : POITERS CEDEX
}

Address : BP 374 39 RUE DE BEAULIEU 86009 POITIERS CEDEX

result expected
 {
po_boxes : BP 374
house_number : 39
road : RUE DE BEAULIEU
postcode: 86009
city : POITERS CEDEX
}
albarrentine commented 7 years ago

Hey @sragneau, have you upgraded to 1.0? It was just released last night, but it's trained on several variants in French (BP, boîte postale, B.P., generates "case postale" in Switzerland).

> BP 374 39 RUE DE BEAULIEU 86009 POITIERS

Result:

{
  "po_box": "bp 374",
  "house_number": "39",
  "road": "rue de beaulieu",
  "postcode": "86009",
  "city": "poitiers"
}

It's not trained on very many examples with CEDEX though. Feel free to download the training data and grep for it, but as far as I can tell, this is how CEDEX is generally used in OSM: https://www.openstreetmap.org/way/37022786

In general, for PO boxes and other fields (apartment numbers, etc.) that are rare/non-existent in OSM, we have to randomly generate the phrases we want libpostal to be able to parse, and append those components to the actual addresses using our new address configs (French example).

It wasn't clear to me how CEDEX should be parsed, so wanted to wait until I could ask some French-speakers. I suppose we could add "CEDEX" randomly to some of the training addresses in France. The main question is: should it be part of the city or split into a separate component, like a second postcode? Also, in OSM it seems that sometimes the arrondissement or another number for the post office is added as well, no?

mkaranta commented 7 years ago

Maybe I can help here.

In the raw web data we encounter, "CEDEX" shows up all over the place. Sometimes it's a suffix on the address, sometimes after the city, and sometimes directly after the PO box number.

It would definitely improve things to add some more "CEDEX" to the training sets. Classifying it as a PO Box would currently probably be the most correct option.

sragneau commented 7 years ago

Hi mkaranta and thatdatabaseguy No I haven't tested your upgrade. I 'll do right now. That could be a great news for me if PO boxes are implemented.

I'm French and i can give you some explanation.

At the 5th line of a mail adresse we can find multiple kind of acronym :

First we have have PO Boxes with the acronym BP following with a number. (BP 374) We have also special mail with acronym CS (Course Spéciale ) following with a number. (CS 345) That's is to indicated to the french post that the mail sorting must be done at the specific hours. And finally TSA (Tri Service Arrivée) with also a number to indicate a specific sort for a company.

So we should get only one of this acronym when the sender write the adresse becauses they used the same line. So you can classify those into po_boxes category.

CEDEX is different. CEDEX must be placed just after the CITY. You can find a number but it's optional. CEDEX it's used to use a name of the city even the company is located near this City. Most of them are written with PARIS CEDEX because a company prefers indicates a famous city. For that CEDEX is write and post_code is changed. Normally Paris's postcode is 75000 but PARIS CEDEX 05's postcode is 75231.

I hope my explanation was clear :)

albarrentine commented 7 years ago

Thanks @sragneau.

The "CS" and "TSA" variants can be added to the configs pretty easily.

I think I'm clear on what CEDEX means. The question is: how should it be parsed? For the Paris example, would it be better to have this?

75231/postcode Paris/city CEDEX/postcode 05/postcode

or this?

75231/postcode Paris/city CEDEX/city 05/city

I think I prefer the first version.

sragneau commented 7 years ago

2nd is good for me. Because 75231 will be placed into postcode line Because PARIS CEDEX 05 will be placed into city line

sragneau commented 7 years ago

Ok good, i finnaly succeded to test the new version into my container. So french po_boxes worked. Thankyou.

Now just CEDEX have to be implemented. I close this issue