openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.01k stars 415 forks source link

UK Postal address issue with postcodes #244

Open MajorChump opened 6 years ago

MajorChump commented 6 years ago

Running into an issue where if an incorrect county is used, ie a council instead of the correct county. For example

58 Gloucester Crescent Hindley Wigan WN2 4DQ

Wigan is the council, instead of Wigan it should be Greater Manchester but in real terms there's nothing wrong with Wigan. This incorrectly results in the postcode being labelled as house, which makes a complete mess of the addresses. Result:

Array ( [0] => Array ( [label] => house_number [value] => 58 ) [1] => Array ( [label] => road [value] => gloucester crescent hindley ) [2] => Array ( [label] => city [value] => wigan ) [3] => Array ( [label] => house [value] => wn2 4dq ) )

Though 58 Gloucester Crescent Hindley WN2 4DQ

Correctly results in: Array ( [0] => Array ( [label] => house_number [value] => 58 ) [1] => Array ( [label] => road [value] => gloucester crescent ) [2] => Array ( [label] => city [value] => hindley ) [3] => Array ( [label] => postcode [value] => wn2 4dq ) )

Will add more examples shortly.

MajorChump commented 6 years ago

Found a nice work around, adding United Kingdom to the end of the strings resolves the issue

albarrentine commented 6 years ago

Hm, so Wigan exists in OSM, but would be mapped to city under libpostal's current mapping rules for the UK. The UK of course has more exceptions than any other country in our configs :-).

Under the current rules, admin_level=8 is mapped to city in the UK unless there is a "designation" tag that is one of {non_metropolitan_county, non_metropolitan_district, unitary_authority}, in which case it's mapped to state_district, and there are exceptions for individual cities. For Wigan, the designation is "metropolitan_district".

I wasn't really sure how to map metropolitan_district because looking at the 30 instances of this tag in OSM, it is used for Birmingham, Manchester, Liverpool, Sheffield, Leeds, and Newcastle as well (maybe a few others), which would have the adverse affect of making those cities state_district as well as the actual metropolitan boroughs.

What I can do is map designation=metropolitan_district to state_district by default and just make exceptions for the cases that are really city boundaries.

There are a couple of other things I'm doing for v1.1 around allowing more verbose UK place names (locality + postal town combos being one of the big ones). Can add this in to the release as well.

albarrentine commented 6 years ago

There might be another issue here on the parser/feature extraction side. The reason the postcode is parsed correctly when you add country name is that in our machine learning model there's a different feature (variable) that fires if a postcode is seen in the context of a valid admin area (11216 could be a house number in California, but if it's seen next to a token like "Brooklyn" or "NY" it's definitely a postcode). If a known postcode is seen without supporting context, the model will be essentially operating without the most important variable in predicting the postcode label.

For most countries that use digit-only postcodes, this is really helpful in disambiguating between house numbers and postcodes, but for the UK and Canada, postcodes are so distinctive that they shouldn't need any other supporting context.

Reminder to self that it might also be helpful is to add a feature which could fire for any postcodes that are unambiguous whether they have context or not, which could only help in the UK/Canada without hurting performance anywhere else.

MajorChump commented 6 years ago

Yeah there's definitely an issue around this. Wigan is a city, so is Hindley but Hindley falls within the Wigan Borough which is probably the best way to describe it. With them both being cities(towns) shouldn't they map identically?

I would have expected 58 Gloucester Crescent Wigan WN2 4DQ to be the same as 58 Gloucester Crescent Hindley WN2 4DQ, but the Wigan version is parsing the post code into house. While the Hindley one is correct.

albarrentine commented 6 years ago

Looking at the training data the only examples of that particular postcode seem to come from our GeoPlanet data. In GeoPlanet, that postal code is associated with Hindley (so Hindley is a valid context), and Hindley is parented by "Wigan Metropolitan Borough", as a state_district. "Metropolitan Borough" is not one of the boundary name suffixes we use to normalize names, which I can add to the list, but in general that means that "Wigan" by itself wouldn't be valid context for that postcode under the current model.

With the proposed changes to feature extraction, none of that would really be an issue in 1.1.

MajorChump commented 6 years ago

That makes sense, I thought that would be the case but I did the following and it parsed fine which made me think otherwise.

58 Gloucester Crescent London WN2 4DQ

Anyhow agreed, sounds like you understand what needs to be done for 1.1 and I can work around for now. Let me know if you need anything UK related.

Thanks!

albarrentine commented 6 years ago

There are other features in the model that can help other than the postcode context such as structural features (previous word=London and previous tag=city may also be a reliable indicator that the next word is a postcode if there's no other strong evidence from the word itself, and a large city like London might have a higher weight for that than smaller cities, enough to overpower the intercept/bias feature).

You can see the features used by libpostal at each word using the .print_features command in the address_parser client. It's kind of like logging for the parser model and can be helpful in debugging postcode issues (particularly the "postcode have context" and "postcode no context" features).

May very well need some UK geography/address help in the future. Used to live there and have been to many of these places but the addresses are still among the more edge-case-y in the world.

MajorChump commented 6 years ago

Not sure if its helps but I'll keep adding examples to this ticket of other UK issues related to postcode which I think 1.1 will fix will allow you to run through some currently bad examples to help with testing.

Seen this today Unit 105, Screenworks, 22 Highbury Grove, London, N5 2EF, United Kingdom, both London and 2EF parsed into City, 'N5' becomes postcode.

albarrentine commented 6 years ago

Couldn't find that particular postcode in the training data. In GeoPlanet there's "N5 2EA", "N5 2EG", and "N5 2EH".

I should also say that in 1.1 there's another form of token normalization which, in addition to the normalizations we already do (normalizing all digits to "D"), will normalize all letters in a numeric token to "X" so tokens like "XD DXX" can share statistical strength without necessarily needing to see every postcode in training.

MajorChump commented 6 years ago

@thatdatabaseguy how is 1.1 coming? I have an internal application I can use it on, whats its current state is it stable enough to use?

albarrentine commented 6 years ago

Not yet. OSM has grown large enough that the 1.1 training data can no longer be built on a single machine in-memory, so in the process of moving it to a Spark implementation. I'm also working on multiple things other than libpostal at the moment, so no ETA. Libpostal 1.0 can parse just about anything in the UK if you can afford a few minor mistakes (GeoPlanet contains almost 2 million UK postcodes, so the 2 examples above simply happen to be postcodes that were neither in GeoPlanet nor OSM). If not, use a regex to extract/remove the postal code, and then use libpostal on the remainder of the address.