incorrect parsing of Irish addresses with Eircodes

freyfogle commented 7 months ago

Hi!

I was checking out libpostal, and saw something that could be improved.

My country is

Ireland

Here's how I'm using libpostal

Parsing addresses

Here's what I did

Tried to parse Irish addresses including Eircodes (relatively new Irish postcode format)

Example: Riverside House, Doneraile, P51 KT93, Ireland

Here's what I got

{
   "city" : "kt93",
   "country" : "ireland",
   "house" : "riverside house doneraile p51"
}

Here's what I was expecting

{
   "city" : "doneraile",
   "country" : "ireland",
   "house" : "riverside house",
   "postcode" : "kt93 p51"
}

For parsing issues, please answer "yes" or "no" to all that apply.

Does the input address exist in OpenStreetMap? no
Do all the toponyms exist in OSM (city, state, region names, etc.)? yes
If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result? NA
If the address does not contain city, region, etc., does adding those fields to the input improve the result? removing the postcode leads to correct parsing
If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse? NA

Here's what I think could be improved

Eircodes are relatively new and only now coming into common use, especially for deliveries. They are not yet widely found in OpenStreetMap. Still, the format is easy to identify and the parser should be able to recognize them.

albarrentine commented 7 months ago

Eircodes were just starting to roll out when it was initially trained but there were very few examples available as most people were using the old system. In a future version I've thought about adding UK/Irish/Canadian/any other similar postcodes directly to the tokenizer since they follow regular patterns that are unambiguous with other types, and then the model can just treat them as a single token and handle within a handful of type features instead of one for every normalized postcode-word (saves space as well, and those don't require geographic context so could remove them from the postcode index - which is stored efficiently as a trie but still clocks in at about 500MB), though that would muck with the weights and require a parser retraining, which is not planned for the very near future, though there's some rearchitecting going on in the background.

This style of postcode only partially benefits from the classic NLP features that are used such as word shapes/digit masks because those would normalize to something like ["pDD" "ktDD"]. With enough training data that can work even without observing every possible postcode, but the data would need to capture every pattern sans digits (for the UK/Canada there were also training examples built off of a somewhat exhaustive list that then gets normalized to word/digit shapes).

One workaround is just to extract/remove with regex before parsing since they do follow regular patterns.

freyfogle commented 7 months ago

yes, we arrived at exactly the workaround you describe, just wanted to make sue you are aware that libpostal does not deal will with Eircodes.

Feel free to close the issue if you like

openvenues / libpostal