openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.05k stars 416 forks source link

Why the lowercasing? #51

Open Ironholds opened 8 years ago

Ironholds commented 8 years ago

Is the lowercasing of input strings in the output needed for the processing, or more a convenience/normalisation thing? My question is because it makes it hard to substitute in replacements for parsed elements in the original address (unless there's a trick I'm missing).

albarrentine commented 8 years ago

It is for model purposes, but an issue to which I've been giving some thought. The first choice many NLP applications have to make is whether or not to use and thus rely on casing information. In something like named entity recognition, casing is super important (my first name uncapitalized is a preposition in many languages), and if both the input and runtime corpora are formal language text like Wikipedia or edited news articles, we probably wouldn't mind that there's a different feature for "i-1 word=Al" and "i-1 word=al".

For libpostal, case information is reasonably reliable in OpenStreetMap, the input corpus (though even then it's not consistent in all languages and countries), but at runtime casing is highly variable. We're either parsing geocoder queries, which are mostly lowercased, or addresses from some file/database, where input may have been uppercased for post office validation, etc. which would mean a model trained on OSM with casing could fail to parse simple things like "S MAIN ST" simply because it's cased differently from what was seen in the training data.

In those cases (pun ever so much intended), there are usually two options: either throw away case information and train on lowercased input, or truecase the input on the way in to the model. I chose the former because simplicity, and also because the original use case I envisioned for libpostal was constructing geocoder queries, where the input will be lowercased downstream anyway. There are now a few more use cases than originally imagined :-).

There may be another option I hadn't considered, which is to run the sequence model on the lowercased/normalized intermediate representation of the string, then map the predictions back to the output tokens. That would allow for returning segments of the original string. Some things like hyphen replacement mean the two tokenizations don't align 1-to-1, but still, it's not impossible.

Out of curiosity, what's the purpose of the substitutions?

Ironholds commented 8 years ago

Basically being able to manipulate the input strings in reference to their parsed components. So, being able to say "replace the house number in $address with $new_house_number", for example.

albarrentine commented 8 years ago

Hm, so the parsed components are returned in order. It should be possible to reconstruct an address by just concatenating the strings in response->components (which may eventually be truecased or mapped to original tokens). Once the values are converted to tabular form, a hashmap, etc. they lose their ordering, but it might be possible to do those kinds of replacements in C land before building the table if that's possible. Might even be able to pass in a user-defined function that gets called for each field.

Ironholds commented 8 years ago

Yeah, the problem is the components are missing characters even when concatenated, right? Like, as well as the lowercasing, things like commas get dropped, which is why both find-component-in-original-string and fuckit-concatenate-it-all-together dramatically alters the content at best and doesn't work at worst.

albarrentine commented 8 years ago

True. Comma removal, etc. is for the same reason as lowercasing, don't want a model that depends on "brooklyn, ny" when "brooklyn ny" is also likely. On the plus side the normalizations make tokenization dead simple in post-processing (just whitespace).

For parsing typically you'd want most of the commas/punctuation removed, no?

> Barboncino Pizza, 781 Franklin Ave, Brooklyn, NY 11238

Result:

{
  "house": "Barboncino Pizza",
  "house_number": "781",
  "road": "Franklin Ave",
  "city": "Brooklyn",
  "state": "NY",
  "postcode": "11238"
}

Keeping the commas on the end in the parse result doesn't seem ideal, would cause some weirdness if you wanted to extract and display e.g. just the venue name (it would always be "Barboncino Pizza,")

It sounds like what you're talking about is more like lexer output where we'd return tuples of (start, length, type) similar to what libpostal's tokenizer returns. That way the original string stays intact and you can extract the substring, replace it, highlight it in blue, etc. without affecting legibility. Does something like that make sense on your end? Not saying I can necessarily build it right away, but can see the usefulness in a couple of projects that use libpostal.

Ironholds commented 8 years ago

Sounds right!

a9rolf-nb commented 8 years ago

We're facing an extension of this issue (with probably the exact same resolution strategy) in trying to use pypostal to both normalize and parse German address data. Every umlaut at any point in the input is transliterated, i.e. ß => ss, ä => ae, ö => oe, ü => ue, on top of the lower-casing. Reversing transliteration after the fact is error prone, as there are numerous cases where a character pair assumed to be a translit result is actually part of the correct spelling. And we still have it easy. Think about languages with frequent use of accented characters. They transliterate to single output characters! Once you transliterate, going back is a surprisingly hard problem.

We think libpostal is excellent at classifying portions of the input into type buckets, and we would love to be able to leverage that. But our use case also involves generating canonical forms of all address components "as you would put in the letter head", and these should be ... in the native language :\

I think python's re.MatchObject gets it very right, and could be a model worth imitating inside the postal.parser.parse_address return list. re.MatchObject gives you access to the full input (.string), the string it matched (.group(0)), and the position in the input where that match was found (.start(), .end(); aka .span()). With information like that, even if the actual "match.group(0)" was transliterated, the original spelling could be extracted from the input, using libpostal's still excellent classification of what that portion was.

The theoretical "match.string" also doesn't even need to be the passed input. As long as it is available after the fact, it can be a half-preprocessed version (after stripping / fixing punctuation and whitespace, but before lc+translit).

albarrentine commented 7 years ago

Hey @a9rolf-nb, libpostal 1.0 has been merged. The parser now uses a simple transliterator which only does HTML entity normalization and a few other tiny things like converting different types of hyphens. It no longer transliterates ü, ß, etc. so should be ok for your use case.

@Ironholds we do still need to do the lowercasing for the time being, as there are still cases where the tokenization in the normalized vs. un-normalized versions would differ.

chahna107 commented 2 years ago

Hi @albarrentine, any further updates for returning the final response of parse_address in terms of the original input? As mentioned in one of your comments, I am looking to get an output of the form (start, end, type) which must be based on the original input text that I provide. But I don't see any such option as of now. Any suggestions for achieving this?