openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

Using field separators to increase parsing accuracy #255

Open jamiehutton opened 6 years ago

jamiehutton commented 6 years ago

Hi there,

We have been using libpostal with great success over the last year - its a great library.

One thing that we think would make the library more accurate is if the library was able to understand separators within the raw data. In most cases people use commas to separate different parts of the address. Take the following example:

100 Queens Road Central, Hong Kong and 100 Queens Road, Central Hong Kong

By reading these it is obvious that the road name should end at the comma. In libpostal currently these characters are stripped out and it ends up parsing:

100 Queens Road Central Hong Kong

This makes it impossible for the parser to know whether central belongs to HK, or the road name. A human cant tell, so neither can the machine

I am guessing this would be a fairly big change by my view is it would make a massive difference to the quality of the parsing.

For reference i have put a few examples in below of raw address strings and their parsed results in Libpostal. All of these examples would have benefited from using the commas as separators.

Is this something you have considered before?

48-56 TAI LIN PAI ROAD, KWAI CHUNG, Hong Kong { "house_number": "48-56", "road": "tai lin pai road kwai", "city": "chung", "country": "hong kong" }

66-82 CHAI WAN KOK STREET, GOLDEN BEAR INDUSTRIAL CENTRE, TSUEN WAN, Hong Kong { "house_number": "66-82", "road": "chai wan kok street golden bear industrial centre", "city": "tsuen wan", "country": "hong kong" }

SALTILLO 205, RODRIGUEZ, REYNOSA, 88630, Mexico { "road": "saltillo 205 rodriguez reynosa", "postcode": "88630", "country": "mexico" }

BOSQUES DE DURAZNOS 187 PLANTA ALTA, BOSQUE DE LAS LOMAS, MIGUEL HIDALGO, 11700, Mexico { "road": "bosques de duraznos", "house_number": "187", "road": "planta alta bosque de las lomas miguel hidalgo", "postcode": "11700", "country": "mexico" }

BEAVER HOUSE, PLOUGH ROAD, GREAT BENTLEY, COLCHESTER, ESSEX, CO7 8LG, United Kingdom { "house": "beaver house", "road": "plough road", "city": "great bentley colchester", "state_district": "essex", "postcode": "co7 8lg", "country": "united kingdom" }

3 FIRST AMERICAN WAY, SANTA ANA, 92707, United States of America { "house_number": "3", "road": "first american way santa ana", "postcode": "92707", "country": "united states of america" }

albarrentine commented 6 years ago

Hey @jamiehutton. Yes, have considered it, and think I generally agree that commas are useful. The challenge is that people use libpostal in many different ways. Your use case, for instance, seems to be parsing fully-formed addresses retrieved from the web or some other pre-existing source. These cases do tend to use commas (in some languages anyway, doesn't apply to Chinese, Japanese, Korean, Thai, etc). However, there's also a substantial contingent of people who use libpostal for geocoding, and in that case we can't rely on commas. Some users type them, some don't. As such, it's a requirement that libpostal handle input without commas as well, and any solution would have to accommodate that case. I have been open to training randomly on examples with or without commas.

Most of the time if libpostal gets a toponym/place name wrong, it's not really commas that are the problem. With the exception of "100 Queens Road Central, Hong Kong", the above cases all relate to other issues that should hopefully be fixed with the v1.1 release (no promises though).

  1. 48-56 TAI LIN PAI ROAD, KWAI CHUNG, Hong Kong Kwai Chung is a point-based city near the coastline and there was an issue with point-in-polygon tests (see #203 and https://github.com/openvenues/libpostal/issues/27#issuecomment-292740044) which has been fixed for the 1.1 training data
  2. 66-82 CHAI WAN KOK STREET, GOLDEN BEAR INDUSTRIAL CENTRE, TSUEN WAN, Hong Kong This is an example of a named subdivision, and similar examples can be found in the India, Jamaica, and the UK among others. In 1.0, we reverse-geocoded to OpenStreetMap's landuse polygons (which is what something like "Golden Bear Industrial Centre" would likely be in OSM) but only used them to get zoning information (residential, commercial, industrial, all those SimCity words). This let us generate appropriate unit types e.g. "Apt" in residential areas and "Ofc" in commercial areas, etc. and do that with a frequency roughly consistent with how things are actually zoned. In 1.1 I'm adding named subdivisions as a "pseudo-component", which means it will be aliased to and parsed as house but will often appear in different places in the formatted address, such as after street instead of in the beginning of the address the way a venue/company name typically would. This will make a transition from road to house more likely in this entire class of parses.
  3. SALTILLO 205, RODRIGUEZ, REYNOSA, 88630, Mexico and BOSQUES DE DURAZNOS 187 PLANTA ALTA, BOSQUE DE LAS LOMAS, MIGUEL HIDALGO, 11700, Mexico In cities in Mexico, it's important to list the colonia (can be either a neighborhood or a subdivision, depending on the type). Unfortunately, we haven't had comprehensive or consistent data on colonias in any of our Mexico data set. OSM has some, not all, and they're labeled differently in different regions. One of the source data sets from OpenAddresses has colonias listed and is reasonably comprehensive (it's a company data set of about 4.9M addresses from INEGI, the Mexican census org and most records have a colonia type/name field). To extract consistent and high-quality training addresses from it, there was a little more preprocessing needed than can be done in OpenAddresses, so have opted to move that to the libpostal side where there's more flexibility. This should improve parsing overall in Mexico. Also "Planta Alta" wasn't one of the floor expressions we generated for Spanish (for levels/units, etc. we have to generate them randomly, no data in OSM, but there are usually only a few distinct patterns per language). Will add it to the config.
  4. BEAVER HOUSE, PLOUGH ROAD, GREAT BENTLEY, COLCHESTER, ESSEX, CO7 8LG, United Kingdom This is an example of the UK "locality + postal town" pattern, see #244 and #165. We're handling this in 1.1 by making another pseudo-component called "locality" which allows more than one city name to be used so the parser learns that two cities in a row is not that uncommon in some cases. In the above, Great Bentley is a village or civil parish (both of which map to city in libpostal), so this is in fact the correct parse at the token level. What I can do in 1.1 is either preserve commas in the output that occur between two tokens with the same tag, which would have the effect of labeling the city "great bentley, colchester", or require that if two discrete known phrases occur side by side with the same tag, the parser simply creates two different strings with the same tag in the output.
  5. 3 FIRST AMERICAN WAY, SANTA ANA, 92707, United States of America This probably has to do with some of the population rules in v1.0. Note that if you add the state, libpostal parses this address correctly. Virtually all addresses in the US include state unless the city is large and unambiguous within the US e.g. Los Angeles. So there are a few population thresholds at which we add state with different probabilities. The idea here is we don't really want every random small town to be listed without a state in most cases because the names are often ambiguous (the most likely meaning of "Brooklyn" is the borough or city_district in New York, but there are also many other Brooklyns that can be city, suburb, etc. and in most cases, so we don't really want those cases to stand on their own, otherwise it's needlessly confusing the parser). So we use population thresholds. For cities in the US with < 10k pop, state is required 100% of the time. For cities 10k-100k pop. it's required 90% of the time, and only 80% of the time for cities with >= 100k pop. The problem is that the population data in OSM is incomplete, so even though Santa Ana has 300k people and would otherwise be listable without a state, there's no population data for it in OSM, so state is required, and thus all the examples in the training data are "Santa Ana, CA", so without that token, it probably looks like a Spanish road name. One idea that might fix cases like this is to include state less often when postcode is present regardless of population.

In the "100 Queens Road Central, Hong Kong" case, while I don't think it's wise to add commas to the parser's actual structure (i.e. as a label that it transitions to and from, otherwise it will rely on them too much), what may be possible is to make phrase search respect commas such that a discrete phrase can't cross a comma boundary unless there's a legitimate comma in the name as with some of the ISO country names e.g. "Congo, Democratic Republic of". The tricky part would be cases like "Boulder, CO" vs. "Boulder Co" (a rare abbreviation for Boulder County). In this case, we don't want libpostal's trie-based phrase search to "overeat" tokens such that "Boulder, CO" is always treated as two discrete phrases (there's a comma in the default US address format), and "Boulder Co" is always treated as a single phrase. This would cause different features to be used in the model if a user typed "Boulder, CO" than "Boulder CO" without commas. The way to handle that would be to use commas but probabilistically ignore them some of the time during training.

That said, if you know a priori that most of your data has commas (it seems from the above to be mostly Anglicized/Latin alphabet), you can also try splitting the address at the first comma, usually the street in the above examples, and parsing it in two parts. That method would correctly parse the Queens Road Central case.