openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

Incorrect parsing of Scottish address due to Road oddity #329

Open jlawrence-yellostudio opened 6 years ago

jlawrence-yellostudio commented 6 years ago

Hi,

Great library so far, thanks for the efforts. I've noticed an issue though:

Example:

County Road, Highland, John O'Groats, KW1 4YR

Results in incorrect parsing, however:

County Rd, Highland, John O'Groats, KW1 4YR

Results in correct parsing.

screen shot 2018-03-15 at 11 46 21

However this is not the case for all instances of road, so far this is the only one i've found.

I'm tempted to work around this by replacing Road with Rd but there might be areas that have this as part of the address which is not the street name, which could cause further issues.

albarrentine commented 6 years ago

It looks like most of the components in this address don't exist in OpenStreetMap, which is our primary source of training data for Scotland and the rest of the UK. In general the machine learning model doesn't need to see every single address in the world to parse them correctly i.e. it can learn patterns, but in this one case it's going to conflict a bit with US addresses, where "County Road" almost always occurs before another street name token e.g. "County Road 123A." As such, the resulting model's going to have a strong weight for the token following "County Road" to be part of the street name, and would need to be trained on specific patterns where that's not the case. There doesn't appear to be an OSM way named "County Road" in that part of Scotland either, so we're probably not able to get any examples in our training set for that case. One solution would be to add that address and maybe an alt_name tag on the road itself in OSM. Alternatively, if there's some convention where all roads of a certain type should occasionally be aliased to "County Road" instead, we can make a rule for that on the libpostal side for the next round of training.

The "County Rd" piece is something of an anomaly. We have many different synonyms listed for "County Road" like "CR", "C.R", etc. again mostly with the US in mind in that case, but we don't have that particular abbreviation listed, so the parser would treat it as two tokens "county" and "rd" vs. one contiguous token "county rd", and that affects the weights it's using. I wouldn't count on that always being the case, as we'll probably add that abbreviation at some point.

Also, can you explain the "Highland, John O'Groats" formulation? Haven't seen it written that way before, with the council and then the village after. Is that standard in Scotland, or the UK generally, and how often is it written that way?

jlawrence-yellostudio commented 6 years ago

Hi, Thanks for your response.

I'm surprised the parser doesn't see the word road and expect that to be the terminator for the road name, because in 99% of cases here I believe the word Road or Street would be the last word, which is why I was surprised it parsed it as County Road Highland, it would be unusual to have a road name like that here.

With respect to OSM, we'll I've just looked at the two maps side by side (i'm not from the area) but it does appear that one of them is incorrect. In OSM the A99 ends over the road that I believe that Google believes is County Road. There is no 'label' however on the road to tell me otherwise, I'm making assumptions based on some of the local business addresses using it.

With respect for the formulation - I don't think it is actually correct, when testing the parser I was searching Google for test addresses that I could use for parsing. I believe it came in the order I've provided in one instance, but I can find numerous instances of it in reverse. It should be in order of city/town then county, in the majority of cases, so I would forget the order of the example I have given. Interestingly though on the website of the hotel at that address they do list it as 'Highlands, John O'Groats' but I think this is either an anomaly or a mistake.

Please see attached images, with respect to differences between google maps and OSM. screen shot 2018-04-03 at 18 53 16 screen shot 2018-04-03 at 18 53 34

albarrentine commented 6 years ago

That's true when the ending is simply "Road" or "Rd", but again in this case, "County Road" is a specific, different road type in the US which is usually expected to be on the left of the name, not the right (it's like a numbered highway similar to "A99" or "M40" in the UK, but we might see "CR99" or "County Road 99"). Because it's so common in the US, that phrase is listed separately in our dictionaries for English. The parser sees known phrases as single words, so "County Road" is seen not merely as a combination of "County" and "Road," but a different word altogether as far as the model is concerned, with its own weights/parameters separate from "County" in any other context or "Road" in any other context.

For any other street name ending in "Road", the model will have learned what you'd expect it to learn, but this happens to be one example which is more difficult to reconcile with everything else it's already learned about US English. Because the phrase "County Road" is so common in the US training data, the model's going to have a strong weight on structure alone that the word after "County Road" should be part of the road name, unless it has sufficient examples of when that's not the case to counterbalance its default assumption.

What we may be able to do in the next version is dynamically compile a list of road types that are usually expected to be followed by numerics. That might make the parser perform a little better on transitions to and from road names.