openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

Suite/Apartment parsing is not correct #125

Closed migurski closed 7 years ago

migurski commented 7 years ago

@daguar and I have been experimenting with @straup’s new libpostal API and finding some weird stuff with unit numbers in U.S. addresses. In most cases, libpostal misinterprets unit numbers as house numbers, and groups terms like "suite" with the road name.

Here are some odd examples:

albarrentine commented 7 years ago

Hey Mike. Yes, apartment numbers were not part of the training data in the master version of libpostal (used by the web API), mainly because OSM addresses don't include much in the way of sub-building information.

There is a new model I've been working on that handles apartment number parsing quite well. The Pelias team has recently integrated an early version of that model into their work.

If you don't mind compiling libpostal locally, the model used by Pelias can be found at: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz. To use that (doesn't require switching branches or anything, it's the same model in master trained on new data), just unpack the contents of the tarball into $DATA_DIR/libpostal/address_parser where $DATA_DIR is whatever you passed in during configure, default is /usr/local/share.

I discussed using the intermediate version with @straup. That's still possible for the web API using the aforementioned steps (can easily be written into a docker file or whatever).

The next release (when parser-data is ready to merge into master) will be able to parse sub-building information in residential, commercial, and university addresses in at least 35 languages.

migurski commented 7 years ago

Thanks Al, I’ll give this a try and see if it helps!

nsutcliffe commented 7 years ago

Hi Al,

We have been testing the intermediate version, and noticed that the parser can get confused between Number and Near when abbreviated to "Nr". For example, we are seeing that for the following address:

Greater Mumbai M Corp., A 204 2Nd Floor Madhav Kunj, Opp Sanskruti Bhuvan Nr Swimming Pool, M G Road,Maharashtra,400067,India the parser understands the flat to be "Number Swimming"

albarrentine commented 7 years ago

@nsutcliffe yes, landmark-based addresses are not currently supported. Libpostal's parser can only parse addresses that are similar to the addresses it is trained on, which come from OpenStreetMap. In OSM, there are tags like addr:street, addr:house_number, addr:postcode, etc. which would allow us to reconstruct a tagged training example for most of the components in the above address like:

Greater/house Mumbai/house M/house Corp./house A/house_number 204/house_number M/road G/road Road/road Maharashtra/state 400067/postcode India/country

OSM addresses rarely include things like addr:floor, sub-building information or directional information relative to landmarks. We generate phrases like "2nd Floor", etc. randomly using either a range of numbered/lettered floors based on the building's height or on average building heights, etc.

The remaining phrases in that address are more complicated. There was some discussion of the issues with landmarks like "Opp Sanskruti Bhuvan" in #103, and a few ideas for how those types of addresses could be added to OSM in India. Generating phrases like "Near Swimming Pool" randomly is more difficult than generating phrases like "2nd floor", because instead of just a random number/ordinal like "2nd", we'd have to generate a sensible random noun (what are all the things an apartment could be near?) Then there's the fact that the word "near" is also used for street intersections, so would want to include that as well.

In any case, I don't think this will be solved any time soon. There simply aren't enough fully-specified Indian addresses in OSM. If you have pre-parsed addresses (separated into fields/columns) with landmarks, etc. that can be contributed to libpostal in some way, I'm happy to use them for the parser, but without more data, it will always get portions of those addresses wrong.

Komzpa commented 7 years ago

@thatdatabaseguy Is there a list with labeled addresses that I can send a pull request against? I've got a set of examples that I can label manually, but I need a place to send them to and example on how exactly to label :)

Komzpa commented 7 years ago

Here are the examples that I'd like to use for training and validation:

197198,/postcode г./city Санкт-Петербург,/city ул./street Съезжинская/street д./house 10/house_number кв./flat 40/flat_number
188541,/postcode г./city Сосновый/city Бор/city Ленинградской/state_district области/state_district, пр./street Героев/street 40,/house_number кв./flat 400/flat_number
albarrentine commented 7 years ago

@Komzpa all the training data is generated from OpenStreetMap (+ OpenAddresses and GeoPlanet in the next release). There are currently no smaller training sets that are PR-able on Github, though it might be a good idea to have some test cases that don't necessarily need to pass to track parser accuracy over time as more data is added to OSM, etc. Most of the time the easiest way to improve the parser is to add addresses to OSM, add abbreviations that are missing in libpostal, or report the issue here as it might indicate a pattern.

As far as the labels:

  1. house is for venue names like the name of a bar or restaurant, not for the phrase preceding a house number, and there's not really separate tags for house number phrase vs. the actual number. So in a phrase like "д 10" or "дом 10", the "д" would be part of the house_number. Most of OSM in Russia doesn't use дом so in the newer training sets we add it at random.
  2. Oblasts are mapped to state rather than state_district.
  3. Similarly it seems like OSM editors rarely add the "г." (for город?) before city names. If the address contains an "addr:city" tag then sometimes people use it, but if we have to reverse geocode to admin polygons, it's definitely not. We might be able to handle that part on the libpostal side by randomly adding the prefix in the training data (similar to what we do with "дом"). Which countries use город? Just Russia? Are there any special rules for when it's used?
  4. There's no tag for "flat" or "flat_number". We use the more general term "unit" and again the phrase is not separated from the number. In the newer training sets we do handle phrase like "кв. 40" by randomly generating apartment numbers/letters and adding certain phrases or their abbreviations from the libpostal dictionaries.

I've been testing several parsers on subsets of the most recent training data, which should roughly reflect the parser's performance in the next release. A version trained on approximately 20 million random addresses will get the first one correct:

> 197198, г. Санкт-Петербург, ул. Съезжинская д. 10 кв. 40

Result:

{
  "postcode": "197198",
  "city": "г. санкт-петербург",
  "road": "ул. съезжинская",
  "house_number": "д. 10",
  "unit": "кв. 40"
}

On the second address it doesn't do so well. The city appears to exist in OSM, as does that apartment building. OpenStreetMap has a slightly different name for Leningrad Oblast: ленинградская область instead of "Ленинградской области". Is that the locative case of the noun or something? Also looks like that postal code never comes up in libpostal's training data. The simplest way to determine if a postal code made it into the system is to look it up on OSM taginfo: http://taginfo.openstreetmap.org/tags/addr:postcode=188541 or http://taginfo.openstreetmap.org/tags/postal_code=188541 (the second one is less precise as postal_code is usually used on polygons to indicate that a whole city is covered by a given postal code or range or postal codes - if it's part of a range, an exact lookup will fail although it may still be in libpostal sine we do parse ranges and use them in the place-only training set).

Unless a number is a known postcode in some context, it's difficult to disambiguate it from a house number in an international way, so if libpostal hasn't seen the postcode before, it will usually get it wrong. At present OSM contains ~11k Russian postcodes out of ~44k. It might help to use the GeoNames postal codes data set, which contains pretty much every Russian postal code, although the place names/admin codes are not linked, so those examples would simply look like 188541/postcode Россия/country, without cities, states, etc. That said, it may be sufficient to just have the number listed as a postcode a few times so that certain features fire in the machine learning model. Otherwise the parser has to rely on structural features.

Speaking of structure, is that the most common format for Russia? The default format libpostal uses looks more like: https://en.wikipedia.org/wiki/Address_(geography)#Russia. I was under the impression that the Russian post had moved to a specific-to-general format since the mid-aughts and that the general-to-specific form was more historical at this point.

Using the default format for Russia and some of the spelling changes above, it can get this version correct:

> пр. Героев 40, кв. 400 Сосновый Бор ленинградская область 188541

Result:

{
  "road": "пр. героев",
  "house_number": "40",
  "unit": "кв. 400",
  "city": "сосновый бор",
  "state": "ленинградская область",
  "postcode": "188541"
}

If the format you posted is reasonably common, I can add an alternate format that gets used some proportion of the time. Currently the strategy is to use the default format from address-formatting, adding sub-building information like unit, level, staircase, etc. according to our own config files. Then, with small random probabilities, we move certain components e.g. "1% of the time, move postcode before state", just so the parser won't get too confident about structures and will use more local features of the current token/phrase.

So firstly, I'd recommend adding "addr:postcode" to that building and some others in the area in OSM (next time OSM planet is downloaded for ingestion that change will be reflected, though not right away). Adding the "г." to Russian (and other?) city names can be done on the libpostal side in the next build. So can the address structure changes. Not sure what to do about the oblast name, although I guess there are only a few of them, so if you want to provide a mapping from nominative to whatever-noun-case-is-used-above, I'll add those forms to the training data.

Komzpa commented 7 years ago

I'd say that both schemes are used in ex-USSR, with older one (postcode-country-city street-house-flat) being more popular among older people, and new one (street-house-flat postcode-city country) being more popular among younger people. The same goes about Surname-Name and Name-Surname ordering. Russian is really flexible about word order, which means there are addresses that aren't unambiguously parseable when you remove commas from them.

Basically, you can take each component, shuffle all words in it, then shuffle all components and still get an address a human will be able to read. And you can get an address in all possible forms, 'I live at Sosnovyi Bor, Geroev, 400' - 'Я живу в Сосновом Бору на проспекте Героев в доме 400'.

Postcodes are six-digit in ex-USSR, so if you see a six-digit number and language is from ex-USSR chances that it is a postcode are extremely high. First three digits are city number (or other administrative region when all you have is a bunch of villages), last three are local post office number. Last three can be zeros if sender doesn't know exactly - postcode was (and still is, machines just got better at reading handwriting) machine-readable in USSR and written in predefined spot on envelope.

image

For NLP in Russian there's a library pymorphy2.

import pymorphy2
morph = pymorphy2.MorphAnalyzer()
b = morph.parse(u'cосновый')[0]
In [17]: print(b.inflect({'gent'}).word)
cоснового

It's useful in two ways - first, it may help you understand what this word is about:

In [23]: print(morph.parse(u'ленинградской')[0].tag)
ADJF femn,sing,gent
In [24]: print(morph.parse(u'области')[0].tag)
NOUN,inan,femn sing,loct

Second, it can help generating datasets - you can randomly change forms and it will give you correct spelling :)

Stripping last letter and last two letters and last three letters can get you tokens that are more useful and contain less form information.

Back in 2014 I wrote a rule based address tagger for a data set of offices of companies that soon will have a government's tax review. (The data was acquired by NextGIS as part of https://github.com/nextgis/skoroproverka).

Government dataset: addr-norep-nocount.txt.gz Tagger that worked reasonably good for me: https://github.com/Komzpa/addresstagger/blob/master/tokenize-address.py (have a look at a large if token in section)

There is also StreetMangler project that aims at aligning street names in Russian to "natural word order". https://github.com/AMDmi3/streetmangler - it maintains a list of normalized street names and a statistical matcher.

Hope this helps :)

albarrentine commented 7 years ago

Right on. So, have added the reversed forms in all the post-Soviet states (kicks in about 20% of the time in most countries). Probably won't shuffle literally all the words around :-), but some of the components anyway. The goal is for libpostal to be able to handle the vast majority of the common patterns geocoder users are likely to type (years of using Google Maps, etc. have probably tempered our expectations a bit). Of course libpostal will probably never be able to recognize things as well as a human postal worker can in a given country/language, but the larger point is that it's international and works reasonably well across many languages.

pymorphy2 is awesome, thanks! Libpostal's training data generation is mostly Python, so it was simple to add and it's now being used to generate locatives for state and state_district in Russian and Ukrainian. Already seeing those forms pop up in the training data, so that should be in the next release.

On postcodes, libpostal should learn something about 6-digit numbers in certain contexts (they're uncommon as anything but postal codes in the rest of the world as well), but it's much better if the postal code occurs in the training data. It's possible that the structural changes will help even for unknown postcodes (gets the one above simply on structural cues), but also not hard to add a feature to the model like "script=Cyrllic and word=DDDDDD" which can linearly separate postal codes in post-Soviet states from, say, 6 digit house numbers in the US without having to rely on specific surrounding words which may be sparse.

Komzpa commented 7 years ago

I'm really sorry for misguiding you - the case you were looking for is not locative, but genetive: https://github.com/openvenues/libpostal/commit/6f009fb8a68566ca9c3bdff87bc1c785af5803af#commitcomment-20315038

Looks like to convert something to genitive you just swap any two adjacent address components in general-to-specific hierarchy and make the larger one genitive.

"Российская Федерация, Ленинградская область" -> "Ленинградская область Российской Федерации" (country:oblast) "Ленинградская Область, город Сосновый Бор" -> "город Сосновый Бор Ленинградской области" (oblast:city) "город Санкт-Петербург, улица Съезжинская" -> "Съезжинская улица города Санкт-Петербурга" (city:street, unambiguous one) "город Минск, улица Победы" -> "улица Победы города Минск" (city:street, here it becomes ambiguous, you can't split this back in a single way and can't put a comma - but given enough context it can be understood.) "улица Победы, дом 15" -> "дом 15 улицы Победы" (street:housenumber) "дом 15, квартира 43" -> "квартира 43 дома 15" (housenumber:flat)

another tricky thing it is nested. my village in Belarus: "Республика Беларусь, Минская область, Червенский район, деревня Ялча, дом 54" -> "дом 57 деревни Ялча Червенского района Минской области Республики Беларусь". (that's basically the reason why this ordering went popular only lately - it already exists in natural language, but in genitive case which is not trivially machine-readable)

Looking at the rest - really cool! Looking forward to the next release :)

albarrentine commented 7 years ago

No worries. Latin class is the last time I thought about noun declensions. Locative seemed to make sense ("in the Leningrad Oblast"), but genitive it is.

coachwei commented 7 years ago

just wondering if the capability to parse "suite/apartment" info that was raised here in this thread has been released into the current code base? Based on my testing, the current code base still has the same problem:

https://libpostal.mapzen.com/parse?address=123+main+street+suite+456+oakland+ca+94789&format=keys

{ "city": [ "oakland" ], "house_number": [ "123" ], "postcode": [ "94789" ], "road": [ "main street suite 456" ], "state": [ "ca" ] }

albarrentine commented 7 years ago

@migurski and @coachwei - libpostal 1.0 was released earlier today and features a better model trained on secondary units among other things. Parses the examples above flawlessly, including several variations:

> 123 Main St Apt 456 Oakland CA 94789

Result:

{
  "house_number": "123",
  "road": "main st",
  "unit": "apt 456",
  "city": "oakland",
  "state": "ca",
  "postcode": "94789"
}

> 123 Main St Suite 456 Oakland CA 94789

Result:

{
  "house_number": "123",
  "road": "main st",
  "unit": "suite 456",
  "city": "oakland",
  "state": "ca",
  "postcode": "94789"
}

> 123 Main St Ste 456 Oakland CA 94789

Result:

{
  "house_number": "123",
  "road": "main st",
  "unit": "ste 456",
  "city": "oakland",
  "state": "ca",
  "postcode": "94789"
}

> 123 Main St Ste #456 Oakland CA 94789

Result:

{
  "house_number": "123",
  "road": "main st",
  "unit": "ste #456",
  "city": "oakland",
  "state": "ca",
  "postcode": "94789"
}

> 123 Main St #456 Oakland CA 94789

Result:

{
  "house_number": "123",
  "road": "main st",
  "unit": "#456",
  "city": "oakland",
  "state": "ca",
  "postcode": "94789"
}

Pull latest and run bootstrap/configure/make/make install to pick up the changes. May want to delete your datadir first to clean up directories that have been removed from the previous release.

As far as the Mapzen libpostal API, that's still on the old version at the moment, but I told Aaron about the release today and he said he'll upgrade the API servers soon.

@Komzpa - the Russian and Ukrainian morphology changes are in this release. So are the additions of "дом" to house numbers, parsing units like "кв. 40", and adding "г." to city names. Most of the addresses mentioned should work great:

> Республика Беларусь, Минская область, Червенский район, деревня Ялча, дом 54

Result:

{
  "country": "республика беларусь",
  "state": "минская область",
  "state_district": "червенский район",
  "city": "деревня ялча",
  "house_number": "дом 54"
}

> 197198, г. Санкт-Петербург, ул. Съезжинская д. 10 кв. 40

Result:

{
  "postcode": "197198",
  "city": "г. санкт-петербург",
  "road": "ул. съезжинская",
  "house_number": "д. 10",
  "unit": "кв. 40"
}

Also, OpenAddresses now contains almost a million addresses from Belarus (Brest, Grodno, Mogilev and Vitebsk), a few sources for Kazakhstan, and addresses in St. Petersburg as well as Moscow using the дом+корпус+строение formats. Libpostal 1.0 ingested and was trained on all of this.

There is one small issue when state/country are included in the reversed format. Realized I misread your suggestion above when building the latest training data. Thought it was fully reversed i.e. postcode->country->state->city instead of postcode->city->state->country (I think Hungary uses that format as well). So libpostal 1.0 should work well for the case where there's only city and the address is written in the reverse format, but might get a few things wrong if the state and country are written in as well. That's a pretty easy fix, just need to revise the templates for the next batch.

Think I'll close out the apartment/suite parsing issue and then can open up new ones for issues with Russian/Belarusian/Ukrainian parsing.

coachwei commented 7 years ago

Nice work @thatdatabaseguy confirmed that this issue is fixed in my testing. Thanks.

jqnatividad commented 6 years ago

Hi @albarrentine, been using libpostal in OpenRefine and its great!

I'm currently working with a housing advocacy project in Brooklyn to protect affordable housing stock in NYC, and we got a lot of housing data with apartment numbers.

The data is fairly clean, and as you might expect, there are a lot of permutations for apartment number.

One pattern is giving libpostal problems:

> 123 Main St, 456, Oakland CA 94789

Result:

{
  "house_number": "123",
  "road": "main st 456",
  "city": "oakland",
  "state": "ca",
  "postcode": "94789"
}

> 123 Main St, 4c, Oakland CA 94789

Result:

{
  "house_number": "123",
  "road": "main st 4c",
  "city": "oakland",
  "state": "ca",
  "postcode": "94789"
}

> 1032 Main Street, 02E, Bronx, NY, 10459

Result:

{
  "house_number": "1032",
  "road": "main street",
  "city_district": "02e bronx",
  "state": "ny",
  "postcode": "10459"
}

I was really counting on libpostal to extract the unit numbers, which is does very well for patterns where you have prefixes like "Apt", "Unit" , "#", etc. but doesn't work for the patterns above - which is the undecorated unit number between "road" and "city"

albarrentine commented 6 years ago

@jqnatividad a very worthwhile effort ✊. As someone who grew up close to the poverty line in low income communities of color, and a current resident of Crown Heights in Brooklyn, which is facing an unprecedented affordable housing crisis, I thank you.

The issue with secondary unit numbers without a phrase has also come up occasionally in some of my recent voting rights work, where voter file addresses may be concatenated by machine or transcribed and have no preceding unit phrase. It's been relatively rare in my data sets, but is on my radar nonetheless.

As mentioned above, most of the secondary unit information in libpostal is generated randomly, and always with preceding phrases like "#" or "Apt". This was primarily to prevent introducing certain systematic sources of error into the model. To illustrate, let's say we generate three basic types of units: numeric only e.g. "12", single letters like "A", and combinations of the two like "12A" or "A12." In the latter case of the combined number/letter, it should be pretty easy for the model to tell that it's a unit number when following a street type (in English anyway, there are lots of different formats in other languages). However, if it were a single letter and that letter happened to be "E", "N", "S", or "W", then the concatenated unit would be virtually indistinguishable from a post-directional i.e. "123 Main St E" could be either "Unit E" at "123 Main St" or shorthand for "123 Main Street East", and libpostal would effectively be training on both answers with no real way to disambiguate. Though less common, the same goes for numeric-only units as you can find examples where "Road 12" is the entire street name, and it may be difficult for the model to disambiguate between "12 Road 12" and "12 Main Road 12" (or doing so might inflate the size of the model, which is already large).

That said, the majority of these cases should be solved when the v1.1 release is available, which has been on the backburner in order to finish the address deduping release as part of the lieu deduping/batch geocoding project.

Is the housing advocacy work public in some way? That's something I would personally contribute to if it were. If you want to email me the details it's [first name][last name]@gmail.com. Can probably train something custom for it in the shorter term.

jqnatividad commented 6 years ago

@albarrentine and your work on libpostal is much needed as well and to be lauded :clap:.

FYI, I've been using libpostal to normalize the addresses, before passing it to NYC's Geoclient geocoder.

I'm using the libpostal-rest-docker and geoclient-docker and its working great, geocoding 100s of thousands of NYC addresses locally without having to deal with throttling limits of the public Geoclient API.

The last thing I have to deal with is being able to tease out apartment numbers and would love to take you up on training libpostal with the dataset I'm using. It's not public though given privacy concerns, so I'll contact you directly if that's OK.