openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.09k stars 421 forks source link

Reverse geocoding to nearest street in training data #265

Open MajorChump opened 7 years ago

MajorChump commented 7 years ago

I've done as much debugging as I can on this one. Data appears in OSM, house names linked with postcode (no road in OSM). If a road name is in the address string the parser will not determine the house correctly and instead everything ends up in road. Perfect example below

Fernlea, Leigh Street, Leigh Upon Mendip, Radstock, BA3 5QQ, { "road": "fernlea leigh street", "city": "leigh upon mendip radstock", "postcode": "ba3 5qq", "state": "england" }

If the road name is removed:

Fernlea, Leigh Upon Mendip, Radstock, BA3 5QQ { "house": "fernlea", "city": "leigh upon mendip radstock", "postcode": "ba3 5qq", "state": "england" }

albarrentine commented 7 years ago

That's probably not a case that can be fixed very easily. Remember that even with the millions of examples from OSM, etc. libpostal still has an error rate of 0.5% on held-out (not seen during training) examples from its own corpus, and road vs. house is the most frequent class of mistakes. These cases will improve as OSM improves, and it seems like almost none of the houses on that street include the addr:street tag, so it might be worth adding tags/addresses to OSM whenever you see a mistake. Every edit will eventually get incorporated into our training data.

I'll also note that this doesn't happen currently in all cases involving house followed by road (for example, if you changed the street name here to something like "High Street" where the first word is more commonly associated with the beginning of a street name, libpostal will get the above case correct, but "Leigh" is a little ambiguous as it is also commonly a surname).

One way we could (maybe, no promises) improve performance in these cases would be to reverse geocode everything that doesn't have a street tag to the OSM road network (ways), similar to what Nominatim does.

As I suggested in #255, if you know that most of your data will have commas a priori, one way to improve parsing without waiting on any release is to try parsing the string in two parts: split the string at the first comma and parse the two strings separately: first everything up to the first comma (could be house_number + road, could be house), which libpostal should almost always be able to identify, and then the remainder of the string.

MajorChump commented 7 years ago

I see, I dont have a great understanding of the internals of libpostal, I've been having a look over it but its highly complex and I'm not a C developer so the learning curve is steep.

It seems to me in terms of UK addresses the most logical solution, is that postcode and street is a one to one relationship. If the parser determined the post code first and used the training data to match this with house names in the string which are typically at the start. The string could be simplified down to Leigh Street Leigh Upon Mendip Radstock and then parsed normally which should result in:

{ "house": "fernlea", "road": "leigh street", "city": "leigh upon mendip radstock", "postcode": "ba3 5qq", "state": "england" }

albarrentine commented 7 years ago

I'd say the main complexity in libpostal is not necessarily the use of the C language but is related to the fact that machine learning is a different animal than writing deterministic code like a regex, which is what most developers would be accustomed to in this scenario. In machine learning, when we evaluate a particular model/algorithm, we usually work with aggregate statistics (accuracy, F1 score, etc.) rather than individual test cases. I recognize that developers have varying degrees of familiarity with machine learning, so I try to at least offer an explanation for why an individual test case might not perform as expected, and if possible, to fix the classes of examples where libpostal is not performing well, but in some cases it's not always immediately apparent what can be done.

The major benefit of machine learning is that instead of me having to think of and write down every possible address parsing rule as code, libpostal essentially reads a billion examples of how various addresses should be parsed around the world and writes its own program with various weights on different word and tag combinations (you can see the input variables it uses for each token in the string by typing .print_features into the parser client - helpful for debugging) and can use the same model to generalize to addresses in Japanese or Arabic or Russian as well as it can to addresses in English in the UK or Jamaica or India or the US. Further, it can improve as people add more addresses to OSM. As long as the examples in the training set bear some resemblance to reality, the runtime parser will do a good job on real addresses. That's quite cool, and not something that I nor anyone else could possibly hope to write out as a rule-based program. The drawback to this approach is that when libpostal gets something wrong, which happens in about 0.5% of addresses at present, it may not always be simple to fix by modifying some code. While most of the rule-based parsers get significantly more cases wrong than libpostal, they do offer the developer a greater sense of control, which can be valuable to some people. With libpostal, you have to be able to tolerate having less direct control over the results, and the possibility of mistakes, but overall many users of this project find that tradeoff acceptable because the results are much better than whatever they were using previously.

When it's clear there's a class of errors that can be fixed by amending the training data (usually a particular style of address we haven't accounted for or a missing toponym) or making changes to how we extract the features or input variables, we fix them, but in practice there are some errors we have to "write off" as part of the small fraction the parser gets wrong, and this case is probably one of them. As there's a suggested, simple workaround for this case and many others like it, I think that will have to be my answer for the moment.

Keeping a runtime index of postal codes->road names for the UK would significantly increase the memory requirements for the parser (there are two million postcodes in the UK, and who knows how many streets - and it only helps in one country whereas everyone would have to pay for it), so that's not really something we can consider. What I was referring to is instead reverse geocoding to the nearest street when building the OSM training set, which happens on AWS and can use more resources than the average user has on their machine. No promises on implementing that for v1.1 or that it will improve this case. I'd recommend using the comma heuristic anyway.