openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 421 forks source link

Using statistical translation algorithm for expansion #313

Open nicslos opened 6 years ago

nicslos commented 6 years ago

Hi!

as I was evaluating libpostal for some commercial usage, I got the idea to involve some IBM algorithms for expansion. Does the training data contain any samples for trying this out?

Thx N

albarrentine commented 6 years ago

Hm, which algorithms? Like translating the address from French to English back to French and getting expansions that way? I'm not sure that would work as SMT systems tend not to translate addresses (usually in aligned text something like "100 Main St" in English would not be translated into "100 rue Main" in French).

Can you tell me the high-level goal and then I can answer more specifically about training examples?

nicslos commented 6 years ago

Hi, there are for example IBM models 1-5 available. See for example http://www.nltk.org/api/nltk.align.html

The translation would not be: English to French but English not expanded to English expanded. It works on pairs of source and target (translated source) samples (so called aligned sentences)

The idea is, that it takes context into consideration, so we would have addressed the saints vs street problem for example.

I will try out to use for example English training data from your provided OSM data, reverse the expanded addresses to unexpanded addresses (somehow reversely applying expansion rules) By this I plan to create expanded/unexpanded alignment pairs and will train some statistical translation models.

It would be helpful if I knew from the data, whether it is an unexpanded address or not. Still don't know whether this is possible.

Will contact you with the outcome, if you like. Thx N

albarrentine commented 6 years ago

I've detailed this in a few other issues but the training data for that can be trickier than one might initially think. While OSM discourages abbreviations, and will usually contain "Street" or "Saint" instead of "St," making any model's job easier (I would probably train context word vectors for that and try to predict the middle word from the surrounding words). The more difficult part is knowing when a token should not be expanded. In many languages, directionals are abbreviated to a single letter, so "32nd Ave E" would mean "32nd Avenue East," but there are also cases like "Avenue E," where "E" just means the letter "E." In English, it's probably possible to use the OSM Tiger tags for this since they have tiger:name_base so there are real ground truth examples for the plain "E" case, but for other languages it would be unclear whether the OSM contributor used abbreviations or whether the example is valid as is. Throwing out all examples that used a possible abbreviation or ambiguous phrase would mean always expanding "E" to "East."

For libpostal's primary expansion use case, deduping/matching, single-best expansions were not necessary as it's fine to produce all the permutations as a set so "st marks st" => {saint marks saint, saint marks street, street marks saint, street marks street}, 3 of which are nonsense, but "saint marks street", "st marks street", and "saint marks st" will all share at least one expansion in common, so they will match.

nicslos commented 6 years ago

Ok, nice to know. Was also thinking about the context vectors but wanted to give the ibm models a try.