openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.07k stars 419 forks source link

Plural generated phrases for units #233

Open coachwei opened 7 years ago

coachwei commented 7 years ago

Input address: "Eldorado Mall, Suites 26-27, Panama City, Republic of Panama, Panama" Parsed result:

"country": "republic of panama panama"
"city": "panama city",
"house": "Eldorado Mall Suites"
"house_number": "26-27"

Yes, the input isn't perfect (which happens). but the above parsed result isn't good in many ways.

One thought: I wonder if LibPostal should take advantage of the existing of "," in the input address, which contains semantic meanings.

albarrentine commented 7 years ago

Most of libpostal's handling of units, etc. is based on generating simple phrases to augment the training data in OSM, etc. Except for maybe one or two cases in OpenAddresses, the parser will not see any plurals like "Suites 26-27" or "Units 1 & 2" in training, and the word "Suites" is often used in hotels, etc. so without commas that's not a totally unreasonable parse.

I've discussed the problem with commas in a few other issues, but suffice to say that asking the model to use comma boundaries to break phrases is something that seems like a good idea (that's how our brains read) but in practice it hurts performance on results without commas, which are equally common. In practice the model does quite well without commas. In cases where it seems like the comma would do some obvious good, there's almost always something else about the address that the parser hasn't seen before or isn't good at identifying.

For reference, my process when diagnosing a bad parse is: keep changing/deleting single words in the address until it gets the right answer. That will quickly determine what the problematic word/phrase/component actually is. In this case, changing "Suites 26-27" to "Suite 26" gets the correct result.