openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.07k stars 419 forks source link

I deeply wonder: Did you completeley ignore gridion city planning style? #295

Open blaggacao opened 6 years ago

blaggacao commented 6 years ago

I understand that the address labels are based on OpenCageData's adress-format, but I feel that the people behind lack a complete understanding of gridion city planning (https://en.wikipedia.org/wiki/Grid_plan).

This is like ignoring structured data for about at least 10-15% percent of the world's population's address data... This is, in whole south america (more correct: former spanish colonies) the grid plan (there) is fundamentally based on the "Laws of the Indies" ruled by King Phillip II of Spain in 1573 (https://web.archive.org/web/20041215070244/http://www.arc.miami.edu/Law%20of%20Indies.html).

That's structure of data, deliberately ignored... :wink:

I made a quick test on your library in the Mapzen online search, and effectively it performs VERY poorly in Bogota, no single address is recognized. QED

¿How can I help?

blaggacao commented 6 years ago

You might tell me which folders of https://github.com/openvenues/libpostal/tree/master/resources are human-editable. I might find an understanding with the examples given and encode my knowledge of Colombian Address format. I have a (public available) dataset with:

... and some understanding of different Colombian Address format representations...

Not sure up to what extend this entropy might be useful...

blaggacao commented 6 years ago

I also have a growing private collection of addresses written by migration customers users... I could use this for validation. But unfortunately cannot publish it.

blaggacao commented 6 years ago

Here is an example of a normalization step via regex from my dirty dataset: https://gist.github.com/blaggacao/1d90b9cf669366113054e4e4045f4883 I think this could easily go into some of the yaml files...

blaggacao commented 6 years ago

https://gist.github.com/blaggacao/60c2004d1af9766f8ae95e0ea68680f0 This is my attempt to normalize Grid based street data. Not sure if it's possible to transcribe (some) structure from the regex into the ml model somehow... Show me!

But there is also a hierarchical nomenclature in Colombia which goes something like this: City -> Barrio -> Block -> House No (no street)

albarrentine commented 6 years ago

We have very much thought about grid addresses in libpostal for Colombian cities as well as other places in Latin America e.g. Brasilia. A decent amount of work was devoted specifically to handling Colombian addresses better and has been incorporated into the 1.0 release. Since a machine learning model's accuracy depends on the quality and quantity of training data, and OSM data in Colombia is fairly sparse with varying formats, I also personally worked to ensure that almost 2 million Colombian addresses from Bogotá and Medellín were included in the OpenAddresses project, and hence our training set for libpostal. This covers almost every address in those two cities, direct from government sources, and it's vastly improved parsing on Colombian addresses, like this one in Bogotá, from our test cases, which most regexes would get wrong:

> Cra 18#63-64 B Chapinero Bogotá DC Colombia

Result:

{
  "road": "cra 18",
  "house_number": "#63-64 b",
  "city_district": "chapinero",
  "city": "bogotá dc",
  "country": "colombia"
}

Ojalá que esto no constituye "deliberately ignoring."

Saying the model performs very poorly without any examples is not particularly helpful. If there's a specific pattern of address that libpostal itself is consistently missing (Mapzen Search not being able to find it is a different story), you're welcome to post the specific case, the expected output and libpostal's result, and we'll attempt to train for that pattern in the next batch (OpenCage's patterns are focused on displaying addresses, and are only a small part of what we do - there's also a massive Python repo embedded in this project that can create multiple formats and styles per country in our training data, as well as generates address components that don't occur as often in OSM like apartment numbers, etc.), which should hopefully make the model perform better on your use case.

In our contributing guide I've outlined a process for diagnosing bad parses, and doing so proactively and constructively. I've found that geo people often have an adversarial relationship with technology ("I bet those idiots didn't think of this edge case!"). It's unproductive, a little mean-spirited, and very often misses the forest (i.e. it's pretty damned awesome and unprecedented to even be able to handle the majority of standard cases in multiple countries) for the trees (pet test cases that are interesting but occur once). I encourage people evaluating libpostal to test it on a representative sample and look at where it's performing well in addition to where it's performing not-so-well, and bring up those concerns in a respectful way, rather than coming in with guns drawn at an open source author who's on the same side and has invested substantial time, labor, thought, and energy into this project that everyone gets to enjoy for free.

blaggacao commented 6 years ago

Sorry for the tone. I hope a stormy start turns into a fruitful collaboration.

My personal motivation is to help you with smart entropy (knowledge about structure) I could possible provide in order to help to improve libpostal so I can use it for my own set of problems in data migration projects...

Interests aligned, and fire ceased, I was basically referring in my honest unbelieve about this result (thanks for the example):

> Cra 18#63-64 B Chapinero Bogotá DC Colombia

Result:

{
  "road": "cra 18",
  "house_number": "#63-64 b",
  "city_district": "chapinero",
  "city": "bogotá dc",
  "country": "colombia"
}

which, I think, is short of nomenclature, that would be something arround:

> Cra 18#63-64 B Chapinero Bogotá DC Colombia

Result:

{
  "road": "cra 18",
  "street": "#63",
  "house_number": "64 b",
  "city_district": "chapinero", # Not a valid "localidad"
  "city": "bogotá",
  "district": "dc",
  "country": "colombia"
}

or better, and even sometimes necesary to make it useful

> Cra 18#63-64 B Chapinero Bogotá DC Colombia

Result:

{
  "road_type": "carrera",
  "road_no": "18",
  "street": "3",
  "house_number": "64 b",
  "city_district": "chapinero",
  "city": "bogotá dc",
  "country": "colombia"
}

This is basically what I mean by ignoring the grid pattern. I think it's not mean spirit (at least it wasn't meant this way), but actually gridion does not fit in road / house-number patter exposed by OpenCageData properly.

Having exposed my opinion, I would feel motivated to contribute more than just bad parses, because I think I have a somewhat growing understanding about the underlying problem...

I would feel very stupid bombard you with bad parses, as I think with a little help I would be able to formalize this smart entropy and feed it back into the model myself... That's how I would love to offer my help for this project.

Now the question, I ask myself: How to extract (I might be able to solve that) and how to formalize that (I'm looking to understand the yaml files and most importantly to know, which I can manipulate in a PR, and which I shouldn't touch.)

blaggacao commented 6 years ago

Well, reading your reply again, I'm actually wondering at this point if "smart entropy" (as I deliberately named it without having a better term for it) is something useful in the context of training machine models. I might have become the feeling I'm in the wrong place... But that then would be (still) ignorance of the machine learning world. :wink:

albarrentine commented 6 years ago

No worries. Thanks for the clarification.

So as far as compound house numbers that have more specific structure/meaning, it's been brought up a few times (#121 and #197). In both cases, it seemed like it was better implemented outside of the scope of the address parser itself.

There are three primary reasons for this:

1. Breaking up discrete tokens can lose information

Tokens are the atomic unit for most sequence models. The parser model operates on tokenized input (we also lowercase, normalize, remove accents, etc.) like this:

input_tokens = ['cra', '18', '#', '63-64', 'b', 'chapinero', 'bogota', 'dc', 'colombia']

and predicts a label for each token, so the output would be an array of equal length, like this:

['road', 'road', 'house_number', 'house_number', 'house_number', 'city_district', 'city', 'city', 'country']

The training examples are labeled in a similar way, so they need to use the same tokenization method both at training time and at runtime.

The trouble with breaking up the "63-64" into, say, "63/street -/SEP 64/house_number" (where SEP is an ignorable label we use internally for non-address tokens like a colon, stray hyphen, semicolon, etc.) is that doing so means we have to break up all hyphenated numbers in every country. In the case of house number ranges, "63-64" loses its meaning if we ignore the hyphen. It breaks the "do no harm" principle of libpostal's lexer/tokenizer, and also the TR-29 spec for Unicode segmentation, which we follow and extend.

We know which country an address is in at training time, but not at runtime (will go into that in a moment), so we're not necessarily able to use any per-country switches at parse time. Even if we were, a hyphen can mean different things even within a single country. Most of the time in the US, the hyphen is used for house number ranges, but it can also be used to attach an apartment/tower onto the house number, and in Queens, New York there's even a system similar to Colombian cities where "86-02 37th Ave" would be 37th Ave between 86th St and 87th St, number 2 (we don't have the distance from the intersection though, which is what's great about Colombia - don't even need full address data for geocoding in the cities, just the road network and a query address are enough to derive an impressively accurate lat/lon). In any case, hyphens can mean many different things to different people, so we try to preserve that potentially-relevant information at parse time.

2. At present, libpostal has no specific knowledge of countries/geography.

The parser is a pure text model that learns the associations between words/phrases and address components. So it might learn a high positive weight for the country tag when it sees "CO" preceded by "Bogotá", "Medellín", etc. and a high positive weight for the state tag when "CO" is preceded by e.g. "Saltillo" (México) or "Denver" (US). But it doesn't really know that Bogotá is in Colombia, nor does it know that "CO" and "Colombia" and "República de Colombia" are the same place.

That said, there's some interesting work on toponym resolution coming soon which would allow resolving all the toponyms in the string to a place hierarchy, and then can traverse the hierarchy to get Colombia when toponyms like "Bogotá" are present in a string (also Bogotá from just seeing Chapinero, etc.). However, as far as the parser goes, we have to handle many types of input from simple street names to venues to simple place names to full addresses. We've trained an address-specific model to predict language given text input, but country is a little trickier when there are no toponyms available. For instance, in OpenStreetMap there's a "Calle 8" in Puerto Rico, R.D., Costa Rica, Colombia, Peru, Venezuela, Argentina, the US, etc. so it wouldn't be clear from just "Calle 8" which country it is. The country can also be ambiguous e.g. there are something like 1200 different cities in the world named "San Francisco". Usually, if there's no qualification, it means the one in California, but for an in-process library without access to a full geocoding database, it's difficult to know without checking the other components against an index.

So even with toponym resolution/disambiguation, we may not always know the country for all of the use cases we need to handle, or may only be able to narrow it down to one of 10 countries.

3. Adding new labels can be expensive in terms of model size/accuracy

The machine learning model we use at present, a Conditional Random Field, is quadratic in the number of labels, and maybe more important, every new label adds more parameters that the model needs to learn to distinguish e.g. "road" from "street" from "house_number", which can increase the already hefty size of the model. It also creates more potential mistakes since the new proposed labels are very similar to existing ones.

As such, we try to economize on adding new labels unless there's no other option and no way to sufficiently cover that case with the existing labels. In the full 1.1 release of the parser (which was delayed a bit to get the new deduping changes out), for example, we're adding a "building" for large apartment blocks that have numbered/lettered buildings e.g. "Torre C Piso 4 Dpto 410". Since there are large apartment blocks all around the world, it made sense to add the new component. However, unless it's a component that can be found virtually everywhere in the world, we try to re-use an existing component wherever possible.

Also, we try wherever possible to respect the tagging conventions in OpenStreetMap, since that's our primary source of ground truth in most of the world. For Colombia, the convention has been to use addr:street for e.g. "Cra 18" and addr:housenumber for e.g. "64-63 B". As I understand it, the "64-63 B" is what would be the label on the outside of the building, which seems to accord with what we generally think of as a house number.


For the moment, I think I'd recommend implementing the more granular parsing as a post-processing step on top of libpostal's results. For instance, if libpostal can get as far as road="Cra 18" and house_number="63-64 B", then the regexes you mentioned can be used to extract "18" as road_no, "63" as street and "64 b" as house_number.

Once we have toponym resolution in place, and can predict country for some addresses, I'd be open to adding a new API that could break house numbers down further per-country or per-geo, again on top of the parser's results (or the raw data in the case of data sets that are already split up into fields). This way it's easy to add one country at a time, and there's no risk of one country's results affecting the rest of the world. Also if country can't be determined, or a country's not mapped, the parser result can be left as is. There could potentially be room for ambiguity with multiple formatting rules and multiple different alternatives given, and all of this would likely improve the new deduping functionality. It would be great to be able to match one address where house_number="1-A" with another where house_number="1" and unit="A", which compound house number parsing could provide.

This would involve a new config, that gets parsed and formatted as C data structures to use at runtime, similar to what's done for the address dictionaries, numeric expressions, and transliteration.

Sound reasonable?

blaggacao commented 6 years ago

OK, thanks for this explanation and crash course in machine learning. I understand now, why there is this gap between the library and intuitive ("amplified") interpretation. Maybe that's part of the reasons why Uber never gets directions right in Colombia. Virtually you're lucky if it does...

If you need my help with geo resolution, I'm a bit into the topic.

However I have one big suggestion to the above: Some sinificant percentage of use cases comes with context information about the actual location, some examples:

This said, I think to optionally scope the model for such cases, country could be made an optional context condition passed to the parser. Most of the time, address space within a country is relatively homogeneous or there are only low-ambiguity parallel schemes (that's the kind of disambiguation that people within society boundaries would naturally develop over time)...

So I'm not sure what would be the impact, but giving the model relative certainty by a pre-processor (even with probability=1) instead of a post-processor about the country scope could potentially have a huge impact on accuracy and also on intuitiveness, and you could probably sell it to Uber right away :)

EDIT: I understand that's probably somewhat similar to what you meant by "(or the raw data in the case of data sets that are already split up into fields)"

blaggacao commented 6 years ago

Actually for the moment being, my regexes do extract about 90% correctly on the grid level, without toponomy resolution and before entrance/house/apartment/office level, which I ignore, as people sometimes even write "between the two street lanterns around the corner" (exaggerating) see: https://regex101.com/r/AccasN/1

EDIT: I'm actually convinced this would perform already better than Uber search in most cases in Bogota, with some simple preprocessing...

blaggacao commented 6 years ago

Geo resolution Colombia:

albarrentine commented 6 years ago

Paraphrasing from a different answer: the idea of feeding admin/country information into the parser seemed intuitive to me initially, but I found that a global model which did not consider country/language performed better overall and was significantly smaller (very important since we're at 1.8GB already). This makes some degree of sense in that we might have many training examples for one country, say México, and not as many for a nearby country, say Honduras. They're both in Central America, share the same language, etc. so it makes sense that some of what libpostal learns about address structure/words/phrases in México will transfer over to parsing in Honduras. Modeling country directly in the parser would mean we're effectively creating multiple different parameter spaces that have to be learned separately, whereas with a global model, every Spanish-speaking country can share statistical strength for the words/phrases/patterns they share in common while still learning their own idiosyncratic words, names, toponyms, etc.

I can revisit that after toponym resolution. One thing we have now with the Conditional Random Field model that we didn't when I initially studied this idea was the ability to do joint input+transition features, so what a CRF could potentially have is a feature that has scoped label transition weights per country and per language (i.e. how likely it is to transition from "house_number" to "road" given country=co, or language=es, or both). That would probably improve performance with minimal weight added since it only needs a few hundred new variables. During training it would then drop out country and language randomly so it's having to learn both with and without those inputs. Though we do know country for certain at training time, it's still better to use predicted country so the parser doesn't have to rely on information it may not have at runtime.