openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

Duplicate json fields in result #27

Open frodrigo opened 8 years ago

frodrigo commented 8 years ago
> 64000 Artix route de l'aéroport 64121

Result:

{
  "postcode": "64000",
  "house_number": "artix",
  "road": "route de l'ae ́roport",
  "postcode": "64121"
}
albarrentine commented 8 years ago

Thanks for reaching out. Duplicate fields actually are allowed in the address parser. The output in the client does looks roughly like JSON. I can change it to an array so it is indeed valid.

Incidentally, what is the correct parse for this address? Does Artix mean the French commune Artix or is it part of the street name? I couldn't find that address in either Google Maps or Nominatim. Is it real?

Note that the parser gets 98.9% accuracy, not 100%, so it may still get some cases wrong. I'm not a native and am always curious to learn more, but from what I understand, the typical way to write a French address would be more like:

> 64000 route de l'aéroport 64121 Artix

Result:

{
  "house_number": "64000",
  "road": "route de l'ae ́roport",
  "postcode": "64121",
  "city": "artix"
}

Which libpostal parses correctly. In that instance, even though 64000 could be a postal code (and is even a valid postal code in France), its position in the address next to a word like "route" helps disambiguate that it is in fact the house number. Whereas when it is seen next to a city name like Artix, it will be more likely to be a postal code.

Libpostal is currently trained on the more standard French address formats. If listing the city before the street name is a (reasonably common), alternative form, we should add that form to the address-formatting repo, which is used by libpostal to generate training addresses from OSM.

Let me know what you think.

frodrigo commented 8 years ago

Yes sample address is ill, but edge cases are still interesting.

I play a bit with libpostal and works fine with simple well formatted addresses, but fail on complicate cases. I will make other issues about this.

albarrentine commented 8 years ago

Ok, just wanted to make sure I understood what the correct answer should be.

We currently train on OpenStreetMap addresses, which are relatively simple (no apartment numbers, buildings, blocks, intersections, etc.), and the formats used to reconstruct the address strings use the standard structure per-country. We train on a few different permutations of each address with certain components dropped (so the parser has examples of just postcode/city or just city, etc.) but always using the standard order.

I suppose we could randomly scramble the order sometimes so the model doesn't rely too much on structural features (the classification of "Artix" as house_number in the first example is likely due to the fact that in all the standard French addresses, the word before "route" is most likely a house number).

I'll be making some updates to the training data soon for the Pelias geocoder. Will update this issue next time the model is pushed to see how it performs on this case.

albarrentine commented 7 years ago

@frodrigo the 1.0 release still gets this parse wrong (I think), but it's closer to correct than the previous version:

> 64000 Artix route de l'aéroport 64121

Result:

{
  "postcode": "64000",
  "city": "artix",
  "road": "route de l'aéroport",
  "postcode": "64121"
}

I added some formatting exceptions to the training data so when libpostal is building its training examples, it can switch certain components at random to create different formats. In the case of France, I allow city to come between house_number and road, but it still has trouble with this parse.

We now import the entire cadastral data set for France and its territories from OpenAddresses, so almost every address in the country is seen by libpostal during training. However, I didn't find any addresses with a 5-digit house number in France. So wanted to ask: are you certain that "64000" is a house number in this case? I suppose maybe if it's a typo or something. In any case, it it is legitimately a house number, we should add more edge case like that one to OSM so libpostal can pick them up for training, and/or I can try adding some random noise to the actual house numbers, for instance appending a digit some random proportion of the time so at least there are some in the training data.

In any case, it's worth giving libpostal 1.0 a try. It works significantly better on a variety of more complicated French addresses. Currently not the "BÂTIMENT F" case, because it's not yet trained on that style and I'm still trying to figure out a method for consistently handling buildings in larger complexes, but it does have several new sub-building components including things like "ap 12", "et. 2", "rez-de-chaussée"/"rdc", "BP 234", etc. There's a config file that details some of the new fields we use to augment OSM in places where it's lacking, and contributions are welcome.

frodrigo commented 7 years ago

Here, 64000 is the postcode, 64121 dosn't look like a French postcode. There is a very high probabily that a French postcode finish with a 0.

city between house_number and road is completely unnatural for French.

I will give a new try of libpostal with the rejected addresses of my local addok geocoder (https://github.com/addok/addok)

frodrigo commented 7 years ago

On last version the parsing look as I expected:

> 64000 route de l'aéroport 64121 Artix
{
  "house_number": "64000",
  "road": "route de l'aéroport",
  "postcode": "64121",
  "city": "artix"
}

Nevertheless I encountered duplicate house number field on very well formed French address

> 26 rue verte 06160 juan les pins
{
  "house_number": "26",
  "road": "rue verte",
  "house_number": "06160",
  "road": "juan les pins"
}
albarrentine commented 7 years ago

Ah ok, so as far as I can tell, both 64121 and 64000 are both valid postal codes, and both are near Artix. So I think in that example there should be two postcodes, it's just ill-formatted. Will revert the change that allows house numbers between cities and roads.

For Juan-les-Pins, I grepped for it in the training data (which are now public, your input is welcome if there are any issues with French addresses), and the only examples I could find were a road in Quebec, Canada. Juan-les-Pins exists in OSM as a suburb, and so does that postcode, but weirdly that point does not match the France country polygon in our polygon index. Points without a country are ignored, so essentially that place name doesn't exist as far as libpostal knows. We build the country polygons with Shapely and no buffering. When moving to unbuffered polygons, I'd spot-checked a bunch of points that were close to borders and it didn't seem to have any problems, but maybe with coastlines it does. Using buffering of about 0.01, the France polygon does contain the point for Juan-les-Pins. So I suppose for reverse-geocoding to country, since that's necessary for addresses to be considered, it might make sense to use a "kitchen sink" approach i.e. do an initial point-in-polygon check on the unbuffered polygons (for accuracy along tight borders) and then a check on the buffered version if there are no exact matches. I will look into this for the next training batch.