openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.07k stars 419 forks source link

Could you explain more specifically how to train the parsing model? #190

Closed cottonty closed 7 years ago

cottonty commented 7 years ago

Hi, I want to parse Chinese address with libpostal, but the provided training data in Chinese is not sufficient so I want to train a new model with my own data. I have about 5G raw data so add them to OSM seems not realistic.

As I see in the training data you provided, there are several files in different types or from different sources, but if I get it right the _address_parsertrain can only take one file as input and output a single set of model. How do you combine all the training data? Must I separate my data into a form of [addresses, places, ways] to train the model? And if I add patterns in the training data, do I need to modify other files to make it work?

Also I want to make sure that if I modify files in resources/dictionaries, is it right that I need to run build_dictionary to get it work? Is there anything else to be done?

Besides above Is there anything else I need to do to customize the parser?

I would be very grateful if you could explain from start to end how to train the model.

albarrentine commented 7 years ago

Hey @cottonty, just curious, what's insufficient about the Chinese addresses in our training data? Is it the raw number of addresses from OSM, etc. or is there some pattern we don't handle?

I've written about this in other issues, but in general I don't encourage training custom parsers, especially on proprietary data, so it's intentionally not documented outside of source comments. The goal of libpostal is to collaboratively solve natural language addresses in one place.

It would be great to add more data from mainland China to the training set. Is the data set you're using publicly available somewhere? Happy to add it to the next training build as long as it's not too restrictively-licensed. I suppose even if it is, we could make a private S3 bucket as well if people want to contribute data that isn't openly-licensed. Parametric machine learning models can be used under a more open license than the data they're trained on as long as the data can't be reconstructed from the model.

Also possibly of interest: there have been a number of pull requests recently in OpenAddresses for Taiwan, which should at least increase language coverage in the next build (although I guess they use Han Traditional more in Taiwan).

As far as additions to resources/dictionaries, just make a pull request and libpostal's CI will build the relevant files and publish them as soon as it's merged.

cottonty commented 7 years ago

Hi. Thank you for reply. There are several problems in the training data for China

  1. I have extracted all training data marked 'zh' and it's only about 200m.
  2. I inspected the data and found it only contains part of the data from some big cities and lots of them are phonetic or in other languages other than Chinese and traditional Chinese, which is almost no use.
  3. Even for these part of data, they are wrongly labeled. For example, data labeled with "state" actually contains "city, city-district, county, state-district" and so on.
  4. The labels defined in libpostal can't cover all the cases for Chinese addresses. 1) It is very common in an address that some organization located in a building or neighborhood. For example, Beijing, Chaoyang District, Beisihuandong Road, Qianhejiayuan (which is a neighborhood) Building No. 1, Room 2003, Jihaizongheng (which is a company). I think it's better to separate these two kinds instead of using a single label 'house'. 2) Because China is a developing country with poor city planning, it is very hard to find a place with only basic address elements like road and house_number. So people used to describe address with "near somewhere" / "300 meters south to somewhere" / "on the opposite side of somewhere" / "inside somewhere" ... So I wonder is it possible to add some other labels to indicate if it is a real address or just some describing information.
  5. It is very common when people fill in a form of address that they re-enter the state, city and district in the address blank even if they have already filled them in the state, city and district blank. But there are also cases that the organization or company or whatever point of interest itself contains the name of the state or/and city or/and district or/and town or/and village and so on. And people like to omit the quantifiers like City, District, Town or completely or partly omit the Administrative division. So for an address whose complete form is "Beijing City, Chaoyang District, Beijing City Chaoyang Language School", it can be "Beijing, Chaoyang, Beijing Chaoyang Language School" or "Beijing City Chaoyang Language School" or "Beijing, Beijing Chaoyang Language School" or "Beijing Chaoyang Language School" and so on. And of course there will be no comma in Chinese to separate all the components. So I guess it's very hard to tell without dictionary and some pruning of redundant.

Actually because of the censorship in China, OSM data for China is almost of no use. https://en.wikipedia.org/wiki/Restrictions_on_geographic_data_in_China

After some experimenting I want to know If I add some labels myself, will the dictionaries operate correctly? I have tried adding something to the dictionary but haven't seen any difference. Is the dictionary used only for expansion or normalization as well?

And I want to know is it even possible to use libpostal and my own data to train a model solving the problems mentioned above. Even if it is not possible for now, could you tell me how to custom every part of the parser so that I can experiment myself.

As for my data, I'm afraid it is not open licensed. If I can confirm my boss to use libpostal maybe we can provide some data to improve it. And my data also has some of the mentioned problems like not standard addresses and full of describing information. I don't know if that is fine to use.

albarrentine commented 7 years ago

For any boundaries that are mislabeled, there's a config here: https://github.com/openvenues/libpostal/blob/master/resources/boundaries/osm/cn.yaml where we map properties in OSM relations to the labels in libpostal. There can be individual exceptions for specific OSM relation IDs and "contained by" exceptions (for example, admin_level=6 is usually a state_district except in cities like Beijing where it's a city_district). I did the China mappings based on what I could find in Wikipedia, where provinces map to "state", province-level municipalities map to "city", etc. If there's something wrong with those mappings, feel free to send a pull request. It's possible to examine the tags in OSM using the OverPass API: https://overpass-turbo.eu/?key=admin_level&value=4&template=key-value to for instance see all boundaries where admin_level=4.

We do have a tag for neighborhoods and other informal place names called "suburb", though I'm not sure if we have any neighborhood data for China. It might be OK to add neighborhoods to OSM.

As far as the building vs. venue/company/POI issue (it's common in places outside of China as well like the UK, India, etc.), having a separate tag for named buildings would introduce many errors and make the model larger. The solution I'm currently developing will add the ability to include both the venue/company/POI name as well as the name of the building obtained through reverse-geocoding to OSM building polygons, and write them in different places in the address, but they will both have the same tag: "house". To separate them introduces many cases of arbitrary ambiguity where libpostal would have to learn to deal with multiple meanings for the exact same object. As an example, consider something like the Empire State Building in New York. It's a tourist attraction, but also an office building and can be included in addresses as either venue or a building. We allow libpostal to learn ambiguous phrases for the purposes of entity resolution (e.g. Manhattan=city_district vs. Manhattan, Kansas=city), but always label a given entity consistently throughout the training set. Building vs. venue names would also be hard to disambiguate even if there were decent rules for labeling OSM objects. Most words used in venue names are also used in building names, so the weights for both classes would become very dense as the model constantly struggles to find the discriminative line between them.

If we do add a separate building tag, it would probably only be for simple numbered/lettered buildings (e.g. things like "Building 12" or "Building A") that we can generate, similar to how we generate unit numbers. This may sometimes be useful in large apartment blocks, etc.

Landmark/directional addresses have been discussed in other issues. They're also common in India and other places. For simple patterns like "Room 2003", we can (and already do) generate them randomly, but for many directional addresses, without proper training data, it would be about as complicated as trying to generate natural language, which is well beyond the scope of libpostal. If the various landmarks, etc. were in OSM, it would even be possible to reverse geocode to the nearest landmark and generate some per-language directional expressions, but there's not really enough data to make it viable in places that need it.

Changes to labels are not immediately visible as it requires regenerating the training data (around 2 weeks) and retraining the model (around 1 week). If you submit your changes before Wednesday or Thursday of next week, that's when I'm planning to kick off the next training batch. There would be a new model available around the end of May. It takes a very large machine to re-generate the training data at present, 40GB of RAM, several hundred GB of disk space, and about 2 weeks of compute time, though I'm considering moving it to a MapReduce setting.

Lastly, on training: again, libpostal really was not designed to support custom training on any arbitrary CSV data set, and there are no plans to support that use case in the future. Libpostal works very well in places that have high-quality open geo data, and does the best it can otherwise, improving as the data improves. You're always welcome to use any code or ideas from this project, and if you want to help improve libpostal in the open, we'd love contributions!