openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

incorrect parsing of Irish addresses with Eircodes #656

Open freyfogle opened 7 months ago

freyfogle commented 7 months ago

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

Ireland


Here's how I'm using libpostal

Parsing addresses


Here's what I did

Tried to parse Irish addresses including Eircodes (relatively new Irish postcode format)

Example: Riverside House, Doneraile, P51 KT93, Ireland


Here's what I got

{
   "city" : "kt93",
   "country" : "ireland",
   "house" : "riverside house doneraile p51"
}

Here's what I was expecting

{
   "city" : "doneraile",
   "country" : "ireland",
   "house" : "riverside house",
   "postcode" : "kt93 p51"
}

For parsing issues, please answer "yes" or "no" to all that apply.


Here's what I think could be improved

Eircodes are relatively new and only now coming into common use, especially for deliveries. They are not yet widely found in OpenStreetMap. Still, the format is easy to identify and the parser should be able to recognize them.

albarrentine commented 7 months ago

Eircodes were just starting to roll out when it was initially trained but there were very few examples available as most people were using the old system. In a future version I've thought about adding UK/Irish/Canadian/any other similar postcodes directly to the tokenizer since they follow regular patterns that are unambiguous with other types, and then the model can just treat them as a single token and handle within a handful of type features instead of one for every normalized postcode-word (saves space as well, and those don't require geographic context so could remove them from the postcode index - which is stored efficiently as a trie but still clocks in at about 500MB), though that would muck with the weights and require a parser retraining, which is not planned for the very near future, though there's some rearchitecting going on in the background.

This style of postcode only partially benefits from the classic NLP features that are used such as word shapes/digit masks because those would normalize to something like ["pDD" "ktDD"]. With enough training data that can work even without observing every possible postcode, but the data would need to capture every pattern sans digits (for the UK/Canada there were also training examples built off of a somewhat exhaustive list that then gets normalized to word/digit shapes).

One workaround is just to extract/remove with regex before parsing since they do follow regular patterns.

freyfogle commented 7 months ago

yes, we arrived at exactly the workaround you describe, just wanted to make sue you are aware that libpostal does not deal will with Eircodes.

Feel free to close the issue if you like