Parsing of Semi-structured data to increase accuracy

stefanrehm commented 6 years ago

Hi there!

Fantastic library and great results so far for the Brazilian localization. Congratulations!

We are looking at a somewhat different situation though. We are working with mostly semi-structured data, which means that part of the address are already pre-specified, while others are not.

For exemple, we have the street, street number, unit and city in un unstructured format while the state is already known:

unstructured: av paulista, 2011, 15° andar, s paulo
structured: state = Sao Paulo

This piece of structured information can be one or more attributes for the whole address (it varies by case). Feeding this to libpostal probably increases the accuracy drastically, especially in edge cases and complicated addresses.

Our questions is on how could we inject this information in the parsing process in order to get even more accurate results?

Thanks a lot!

mkaranta commented 6 years ago

So this is actually the primary problem I've been facing with my use/integration of libpostal. My primary use case is comparison, not parsing, but libpostal parses much better with these additional pieces of information in the right context-sensitive locations.

The best advice I can give here is if the thing you know is present in the address, then replace it with the best version so av paulista, 2011, 15° andar, s paulo -> av paulista, 2011, 15° andar, Sao Paulo. Another solution I found is to parse the various expansions and only keep the parses/expansions that match my "known" or structured data.

My problem is a little different because the structured data is sometimes present in the address and sometimes not so it's more of a data sanitation problem. The new comparison api means I get to ignore this duplication and it might also help with the problem you face.

albarrentine commented 6 years ago

@stefanrehm thanks! There's fairly rich open data in Brazil from the 2010 census as part of the OpenAddresses project, and we train on all of it, so the parser should do well there.

The idea of dynamically feeding admin/country information into the parser seemed intuitive to me initially as well, though after some experimentation, I found that a global model which did not consider country/language performed better overall and was smaller. This makes some degree of sense in that we might have a ton of training examples for one country, say México, and not as many for a nearby country, say Honduras. They're both in Central America, share the same language, etc. so it makes sense that some of what libpostal learns about address structure in México will transfer over to parsing in Honduras. Modeling country directly in the parser would mean we're effectively creating multiple different parameter spaces that have to be learned separately, whereas with a global model, every Spanish-speaking country can share statistical strength for the words/phrases/patterns they share in common while still learning their own idiosyncratic words, names, toponyms, etc.

The other thing about modeling country directly in a global parser model, is that we'd have to either always know the country at runtime (not an assumption we'd like to make for everybody), or learn a global model for when country is not known and per-country models on top of that (that would balloon the size of the model, which is already relatively hefty at 1.8G of disk/memory usage).

In the address posted above, I think the current parser makes one error on "15° andar", but that's not related to localization, just to the fact that we have to randomly generate most of our sub-building information like apartments, floors, etc. and in the 1.0 release we only used a different character for the ordinal indicator, so "15º andar" will parse correctly, but "15°" andar was not really seen before. In 1.1 we'll use both variations randomly, but in the meantime, it's simple to replace the degree symbol with the masculine ordinal indicator "º" as a preprocessing step before sending the input to libpostal for more accurate results in Portuguese, Spanish, Italian, etc. A lot of times if something isn't working well in libpostal, it's because there's a minor difference between the training data and the input, so it's always useful to try a few different variations, report which versions are working/not working here, use a regex replacement as a stopgap, and we'll add the new variations to the next training batch.

As far as the main question about parsing where some fields are known, the same structure exists in many of the data sets I work with as well, such as voter files or the global venue data sets used for the lieu deduping project. In all of those cases, there's an unstructured street-level address + the city, state/region, country, postcode, etc. as separate fields. While it should be possible to parse the unstructured street-level details field by itself, and we do train on partial addresses without the city/toponyms, I've found the 1.0 parser under-represents these cases, so for 1.1 I'll try to generate more street-address-only examples so the parser can handle them better.

For best results right now, as @mkaranta says, the parse will often be more accurate if you concatenate fields like state, country, postcode, etc. that you already know onto the unstructured input (using the most common format so can add something like ", Sao Paulo, 01311-300, Brasil" depending on which fields you have - we cover all the Brazilian state abbreviations as well, so also feel free to use "SP"). From there, you can start with all the fields that the parser predicts and then just overwrite the ones you already knew about with the existing structured fields.

stefanrehm commented 6 years ago

@mkaranta @albarrentine Thanks for the detailed analysis and rich explanations! I suppose we will use the concatenated approach. In the end we could automate some sort of feedback if the model does not predict the fields we already know..

@albarrentine Any ideas of how we could help to improve libpostal for Brazil? We havent actually tested the performance on a broad level but are planning on doing so in the coming months. If there is anything we could do we want to already incorporate this in the initial use case.

openvenues / libpostal

Parsing of Semi-structured data to increase accuracy #301