US address expanded incorrectly

openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

MIT License

4.08k stars 421 forks source link

US address expanded incorrectly #302

Open antimirov opened 6 years ago

antimirov commented 6 years ago

Hi guys!

I've been testing address parser and expansion on some random combinations of house numbers, streets and cities. Sometimes the results are puzzling.

For example, '4123 Griffin Ave Los Angeles CA 90031' is expanded in this way:

["4123 griffin avenue los angeles calle 90031","4123 griffin avenue los angeles compania 90031","4123 griffin avenue los angeles coahuila 90031","4123 griffin avenue los angeles ca 90031","4123 griffin avenue los angeles compania anonima 90031","4123 griffin avenue los angeles california 90031"]

I'm surprised that an address from something as big and important as Los Angeles could have been misinterpreted by address expansion algorithm. I mean, the number one choice of expansion is way off with 'CA' -> 'calle'. I can't imagine that the string 'Los Angeles CA' was not present in the training set.

Any ideas?

mkaranta commented 6 years ago

The expansion expands 'terms' mostly without context. I'm probably not using the write word here but 'terms' can be tokens like "ca" which could expand to one of {canada, calle, california, ...} or 'X V' which could expand to 15 as a roman numeral.

eg.

~/libpostal/src$ ./libpostal 'x v'
x via
x v
15
~/libpostal/src$ ./libpostal 'ca'
coahuila
calle
compania
compania anonima
ca
california

Actually, I'm wondering why 'ca' doesn't expand to Canada.

antimirov commented 6 years ago

Oh, that's unfortunate. I really thought the address_expand functionality was sorting the variants before outputting them. I can imagine that 'Los Angeles CA' must get a better bayes score than "los angeles calle'. Maybe I can get around it. What's the best place to look to understand how libpostal works internally, a higher level description? Besides source code? :)

mkaranta commented 6 years ago

@albarrentine has left links to papers & sites that describe the algorithms libpostal implements. I don't understand all the code (nor have I seen it all yet) but the algorithms are implemented pretty faithfully to the specs and with clean architecture, so learning the code is not too hard.

albarrentine commented 6 years ago

Hi @antimirov. expand_address does not use any machine learning or ranking of results, nor was that the intended functionality. The results are meant to be treated as a set, so the order doesn't matter as long as similar addresses share an expansion in common. The result is stored as a list because not every language has native sets, and C certainly doesn't, so we return them as an unsorted array or unique strings. The original intent of expand_address was this operation: len(set(expand_address(a1)) & set(expand_address(a2))) > 0 if a1 and a2 are the same.

In the case above, the correct expansion is included in the set: 4123 griffin avenue los angeles california 90031, so it's not a mistake, just that the results are unordered. The reason the Spanish results are included as well is that we train a language classifier (the one piece of machine learning used in the expand API) on geographic strings to help select/narrow down which dictionaries to use for expansion. This simplifies the API and means that the user doesn't need to know the languages a priori. In this case the classifier predicts English with a 92.9% probability, and Spanish with a 6.9% probability, likely because Los Angeles is a Spanish name, and those n-grams can be found in other parts of the Spanish-speaking world (also there are parts of Southern California that have fully Spanish street names, and we recognize Spanish as an official language for the US in libpostal). Most of the time, a US address will have a > 95% probability of English, so none of the other language dictionaries are used, but in this case it was slightly less sure.

The expand_address API also has an optional parameter, address_components, so if you know that CA is the state (after parsing for instance), you can use the ADDRESS_TOPONYM component, which would ignore some of the possible expansions like "calle" which mostly apply to streets.

The problem of automatic disambiguation may not be quite as easy to solve as it seems initially. Sure, for "Los Angeles, CA" there's only one real solution, but there are more difficult cases like knowing that the "E" in "Avenue E" just means the letter "E" but means "East" in "35th Avenue E." It's possible to use OpenStreetMap to train such a model, since abbreviations are discouraged, and in the US it might be possible to train on the ambiguous cases because of the presence of tags like tiger:name_base, but this does not exist in other countries, so would be assuming that every time we encounter "E", it just means the letter, which may not be practical since there's still a substantial number of abbreviations in OSM. That's just for English, but there are variations in other languages too, and essentially every time there's a one-letter abbreviation, we have a different disambiguation problem (is this a legitimate single letter or did someone use an abbreviation in OSM?)

If the training data were there, the way I'd generally think of it is as less of a multinomial predict 1-of-N sort of problem and more as an n-gram language model (if this token were unabbreviated to one of its possible canonical forms including a single letter, which sequence would be most likely?)

At this point it may be possible to deprecate the public expand_address API at some point now that we have the near-duplicate detection and deduping APIs (which use some of the expand functionality internally, but comparisons are component-wise, with specific logic for each component). In practice I've found that what most people were looking to do with expand_address is address matching, and now we've implemented those ideas directly, so the lower-level function may not need to be exposed. The only other case I've seen for doing something predictive with expand_address is displaying normalized addresses (not a goal of this project), though since we lowercase, strip accents, parse numeric expressions, etc. that seemed more like something users could implement in a simpler way themselves if they needed canonical display forms.

For a very detailed high-level but reasonably technical overview, here are the two blog posts I wrote about libpostal:

antimirov commented 6 years ago

Thanks for the explanation! It's 'UGE!

In this case the classifier predicts English with a 92.9% probability, and Spanish with a 6.9% probability

How do you see/get those values?

albarrentine commented 6 years ago

The language classifier API is not exposed publicly (though we do have a new API function called place_languages that takes a parser result and returns the predicted languages, though without the probabilities).

There's a test program called language_classifier that builds when you run make, which is what I used to get the values above. Usage:

./src/language_classifier "4123 Griffin Ave Los Angeles CA 90031"

We treat any language with a predicted probability of >= 0.05 as a possible language for the address. There's usually only one language per address, and while 1-of-N models like multinomial logistic regression are not great at doing true multi-labeling, in practice when more than one language is predicted, it's usually because the address really is in multiple languages (e.g. Brussels, etc.). The primary language it identifies is virtually always the correct one (except maybe in really bizarre cases where there's little non-numeric text to work with), but sometimes there are false positives in identifying a second most probable language i.e. in California, New Mexico, Texas, etc. when we have many Spanish street/city names, that may be enough to put Spanish above the prediction threshold as a secondary language possibility.

Because the focus of this part of the library is deduping, we don't need to be too concerned with having a few erroneous expansions (if you have a different use case, can you explain what you're trying to do?). That said, it's possible to either specify languages=['en'] to expand_address and bypass the language classifier altogether, or can do something like this with the new place_languages API from the deduping work (only available in Python and C at present):

from postal.dedupe import place_languages
from postal.parser import parse_address
from postal.expand import expand_address

address = '4123 Griffin Ave Los Angeles CA 90031'
parsed = parse_address(address)
labels, values = zip(*parsed)
languages = place_languages(labels, values)
primary_language = [languages[0]] if languages else None
expansions = expand_address(address, languages=primary_language)

That would return only:

[u'4123 griffin avenue los angeles california 90031',
 u'4123 griffin avenue los angeles ca 90031']

antimirov commented 6 years ago

Maybe it should be a separate issue, but I've just tried doing 'git pull; ./configure ; make' and then running the tool you mentioned. While I get the 'en' and 'es' probabilities, I also get this error msg:

$ ./src/language_classifier "4123 Griffin Ave Los Angeles CA 90031" ERR transliteration table is NULL. Call libpostal_setup() or transliteration_module_setup() at transliterate (transliterate.c:675) errno: None Languages:

en (0.929982) es (0.069740)

antimirov commented 6 years ago

May I ask, why doesn't place_languages() return probability of each language? This is such an important information, in my opinion. You already have it in the src/language_classifier tool, but it seems like a waste of time to launch it for every address line. If I added this to the place_language, would you accept the pull request?

(A list of countries with probabilities would also be a good addition to that function).

albarrentine commented 6 years ago

Fixed the warning about transliteration.

For our purposes the probabilities were not as important to expose because we're more concerned with false negatives (missing a correct expansion because there's more than one language, etc.) than with false positives (producing the wrong expansion because two languages were identified when only one was needed). Overall the current decision threshold was tuned with three conditions in mind:

There are many cases where street signs/venue names have two languages and one is systematically longer than the other like in Hong Kong where both English and Cantonese are used and there may be only three Han ideograms but 10 characters of English. French tends to have longer words as well, so it will also tend to dominate other languages in a multilingual short text. If the threshold is too high, these cases will be missed.
Some languages are statistically very similar. In the few cases where libpostal is unsure of the language or the quadgrams can be ambiguous (in languages like German, Danish, Dutch, etc. for shorter or less common strings in one language the model may predict all of the others with a small probability as well). There are also cases where transliterated sequences in a language like Korean may randomly look a little bit like, say, Norwegian.
Many of the street names around the world, for various terrible reasons related to imperialism (very often by the US), are borrowed from other languages, so it may be that the "official" (colonial) language is English and that's the language we would use for street expansions, but the majority of the quadgrams in the rest of the string are really from another language, and may contribute the majority of the string's quadgrams, so the language that's used for street expansions may be secondary in some cases.

In my experiments, using a 0.05 threshold for inclusion was able to account for the "wordier languages" phenomenon by ensuring that the correct language was in the top N (almost always the first most probable), while also being able to narrow down 1-2 languages for the case of super ambiguous expansions like "St" (if we have no evidence of e.g. Slovenian in the rest of the text, we're not going to want to replace every instance of "St" with "Številke" or that would balloon the number of expand permutations unnecessarily).

The first predicted language is almost always the correct one. The classifier is about 98.5% accurate on a held-out set, where some of the data are mislabeled so may be even better than that, and certainly far better than something like the Naive Bayes-based Chromium language detector, which performs poorly on short abbreviated geographic strings.

One case where it does not do well is when there are two languages where class imbalance may affect the outcome slightly. Dutch in Brussels is an example because the Netherlands/Flanders have better OSM coverage than all of the French-speaking countries combined, which means, even though the French street name can be longer, sometimes the classifier will predict only Dutch with an extreme probability close to 1.0, even though its estimates are usually better calibrated and will reflect multiple languages. Part of the reason for this is we don't train a specific objective for the multi-label case, would just create separate training examples for e.g."name:fr" and "name:nl". Ideally language would probably be treated more as a multi-label or sequence/range problem, though I think it may make more sense to keep the current model and down-sample languages like Dutch that are over-represented in the training set. Might revisit that piece at some point, though so far in the deduping use case, there weren't a ton of real-world examples in the affected places that it needed to be solved right away.

Can you explain what you're trying to do and why having exact languages/probabilities is an issue?

albarrentine commented 6 years ago

As far as country prediction, it's been discussed in a few other issues, and may be possible to implement soon. To work well, it would need to be done using toponyms and a place hierarchy rather than simple text (i.e. cases like "100 Main St" don't tell us much about the country, as addresses tend to have more to do with the languages people speak than the borders in which they reside). Providing meaningful probabilities in that case may not be as easy as with languages because they would have to either be based on population data (often incomplete, even for large cities) or address density estimation (global address data is incomplete in OSM, especially for e.g. China and India, and would also require either using OSM's hierarchy, which triggers share-alike, or deduping OSM with another place hierarchy, which is not impossible but a substantial undertaking).

antimirov commented 6 years ago

Thanks again for your comprehensive explanation.

What I would like to accomplish is still the same, I would like to be able to set some threshold so that "Los Angeles CA" is not expanded as "los angeles calle" or "los angeles coahuila". In this example, while Spanish has probability of only 0.069740, it get its significant (first 4 of 6 elements) place in the expansions list. If I could pass a threshold of, say, 10-15%, to the expand_address, this would not have happened.

albarrentine commented 6 years ago

My question was more about the end goal i.e. what are you planning to do with the result? Display it? Store it? Just want the output to show the correct result first? The reason I ask this is: if the goal is to display the top result, that's really not what expand_address was designed to do, and exposing the internals or allowing the user to set a higher threshold for language classification (and there really is no perfect threshold) will not solve the underlying issue, which is a mismatch in use cases. To illustrate, even a very simple case in a single language like "100 Main St" will produce "100 main saint" as the "top" result (I highly recommend wrapping the results in a set in Python, makes the meaning more clear i.e. we wouldn't think of twice about why an element comes first or last in a set as it simply depends on the hash value). Of course "100 main saint" is nonsensical to a human, but in a deduping setting it means we can ensure the correct answer "100 main street" is one of the potential results with 100% accuracy without needing to disambiguate when "St" means "Saint" vs "Street". Predicting that with a model will inevitably introduce some errors since models are rarely perfect, but, in exchange for tolerating a few extra wrong answers, we can guarantee that set joins work for deduping. That's an acceptable tradeoff for our use case, so fancier methods were not necessary.

It's always possible to parse addresses first, then pass only the street-level details to place_languages, expand_address, etc. if you just want better accuracy. Or you can do what we do in the new deduping API for instance and call expand_address multiple times with different options for address_components= for each component, so only house number phrases apply to house numbers, street phrases apply to streets, toponym phrases apply to toponyms, etc. That ensures that "CA" will not expand to "calle" because it's not part of the street name. The general expand_address API is unaware of address components, so tries not to make assumptions about what might be passed in.

The reason place_languages returns Spanish in this case is simply that there are legitimate Spanish tokens in the address when city=Los Angeles is passed, and it operates on character sequences. If instead you pass street="Griffin Ave" to place_languages it will predict English and use only the single language. There are still some cases like "El Segundo Blvd" which might predict both Spanish and English, and that's valid. It's possible, though maybe uncommon, to say/write "El 2o Blvd" and a Spanish speaker living in LA would know what that means.

Despite the ambiguities, it's still often helpful to use the city name in language classification in the general case when it's available because sometimes that gives us a little more text and tells us something about the language (or the address may simply be a venue name + a city in which case we have no street name, and e.g. a restaurant named "Il Fornino" does not necessarily mean the whole address is in Italian, as it could just be an Italian restaurant in Japan or somewhere else, so the city can be useful in that sense). That's why place_languages will use the city if given.

At the moment, I'm not convinced that it's necessary to expose the language probabilities, which are effectively model internals, as doing so would limit my flexibility in changing the implementation. For instance, I might at some point decide to model the languages with word pseudocounts instead of probabilities, or scores that don't have a probabilistic interpretation, or a one-vs-rest classification instead of probabilities that sum to 1 over all classes, and then any external thresholding people were doing would break. With the parsing model, I'm more willing to move to probabilities and top-N parses because I can commit to that being the model. However, language classification in potentially-multilingual short sequences is not a well-studied problem, and I may want to change the way we model it (e.g. in many of the countries mentioned, deduping would benefit from modeling multilingual street names/toponyms directly), and would prefer to have that option.

jsfenfen commented 6 years ago

Can I suggest that y'all include the complete result of expand_address in the table in the readme, which looks like the below as of today?

The response I get for 'One-hundred twenty E 96th St' is: ['120 east 96th saint', '120 east 96th street', '120 e 96th saint', '120 e 96th street', '120 east 96 saint', '120 east 96 street', '120 e 96 saint', '120 e 96 street']

While the reasons for your choices seem pretty well set out in the above, it can be pretty confusing to the naive user as to why the expansions shown in the readme table don't match up to the expected result.

Indeed, the "set" response is probably not the layman's understanding of normalization, and can be frustrating in a context where you want one normalized response per input. I found the postgis tiger geocoder pagc_normalize_address to be more what I thought of as "normalization". Certainly it works on a smaller input domain, but if your input fits, it may be relevant.

pointOfive commented 6 years ago

Hey -- not to pile on in a bad way, because I've been reviewing and exploring this project for the past five hours and think it's really outstanding; however, I agree with the previous replier that the layman's (vernacular) understanding of the term "normalization" would be "a single best result conforming to some specified standard" -- like the option google maps returns or amazon makes you choose when when you've entered in some non-standard (and possibly slightly mangled) address. Much of my reason for spending so much time orienting myself to the project has been the necessity of coming to terms with the fact that, as you have generously clarified above (thank you!), the project (to date) indeed was not intended to provide the functionality I had naively expected (as I searched in vain through parameter flags and mostly undocumented auxiliary functionality in the hopes of arriving at the promised land).

Also, since you've asked about use case several times without getting a direct answer, I wanted to throw my answer out there: I'm looking to augment a record matching process. I was hoping to standardize (or, as I had previously misunderstood, "normalize") addresses on which to perform joins (on perfect matches). Reading through the deduping API (you link above) has been very helpful in that regard.

missinglink commented 6 years ago

@pointOfive I don't believe it's possible to have a single canonical representation of all addresses in all locales (although it is easier in English).

For example, the humorously named Max-Beer-Straße in Berlin could also be written as Max Beer Str. or Maxbeerstrasse, or even Maxbeerstr, which one would be the canonical version and how would you deal with the compound words when mapping between them?

The 'best' selection there would be Maxbeerstrasse as it's easy enough to convert the other three variants to this by removing punctuation and flattening the Eszett, although this is not how it's written on the street signs, and not how people would like it displayed.

Possibly not the best example, but somehow illustrates that it's likely not possible to come up with a single canonical version of an address.

I'm sure Al will have some better examples :)

missinglink commented 6 years ago

Another example is something like St Pauls, is that Saint Pauls or Sankt Pauls?

If libpostal chose one then it would undoubtedly be wrong in a lot of cases.

missinglink commented 6 years ago

One last example, in English are street suffixes, where they can, unfortunately, share the same abbreviation:

eg. the abbreviation br can mean brace branch or brae.

So the tradeoff of having a single 'expansion' is that the library would have to guess, and in a lot of cases would guess wrong.

Is that what people want from this API? or would it be better to leave it as-is, so that we can have all potential forms, which are useful for matching and deduplication?

timstallmann commented 2 years ago

If anyone else is, like me, reading this thread and having trouble finding the links to the dedupe API documentation (links in the original comments are now stale :/), here it is