openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

Some multi-word toponyms get broken up incorrectly #178

Open steveha-ziprecruiter opened 7 years ago

steveha-ziprecruiter commented 7 years ago

Found using Python bindings, verified using src/address_parser test tool.

Location strings including "Long Island City" can confuse libpostal. The city can be returned as "city" or "new" and the state returned as "york". The string "Long Island City, New York, NY" does produce a good result.

I used Google Maps to look up the addresss of businesses in the area, and Google Maps at least uses the form "Long Island City, NY" rather than something fully qualified like "Long Island City, Queens, New York City, NY" So I think this is a real issue.

Examples from src/address_parser:

> 22-25 Jackson Ave, Long Island City, NY 11101

Result:

{
  "house_number": "22-25",
  "road": "jackson ave",
  "suburb": "long island",
  "city": "city",
  "state": "ny",
  "postcode": "11101"
}

> Long Island City, New York, NY, US

Result:

{
  "suburb": "long island city",
  "city": "new york",
  "state": "ny",
  "country": "us"
}

> Long Island City, NY, US

Result:

{
  "house": "long island",
  "city": "city",
  "state": "ny",
  "country": "us"
}

> Long Island City, New York, US

Result:

{
  "suburb": "long island city",
  "city": "new",
  "state": "york",
  "country": "us"
}
albarrentine commented 7 years ago

Gotcha, just today I noticed something similar with certain multiword place names that are highly ambiguous (e.g. New York can be city or state).

It has to do with the feature representation of the input for those multiword strings (in the address_parser client, you can type ".print_features [on|off]" to see how libpostal represents your input under the hood - helpful in debugging).

One of the most helpful input features in the model is the combination of the previous tag and the current word. So for instance "prev tag=city and current word=new york". For multiword strings that are known place names from the training data, we group the phrase into a single token in terms of the feature representation. However, the parser still has to make predictions at each individual word. This results in the following confusing-to-parse case: in the phrase "new york", on the word "new", the features might be something like "prev tag=road and current word=new york" whereas on the word "york" it's "prev tag=city and current word=new york" where all the other features are the same. That representation is ambiguous with "new york, new york" where the second "new york" is the state. Hence sometimes it can result in a bad parse. It doesn't affect too many places, but comes up in a few annoying cases like this nonetheless.

That should be able to be fixed by changing the feature representation to take into account which word we're on in the multiword phrase i.e. "prev tag=city and phrase=new york and current word=york". Assuming that goes well in my trials, the parser will have to be retrained, which takes about a week.

albarrentine commented 7 years ago

Ok, there's a new version training now. However, this may also have to wait on new training data, as in the 1.0 set we were a bit overzealous with requiring city in all training examples. Long Island City is technically a suburb/neighborhood, but can be used in place of a city name next to a word like "NY" - in our training data it would have always been "Long Island City, New York, NY" or "Long Island City, Queens, NY" (libpostal parses those cases correctly) and probably never "Long Island City, NY", so it will have a tough time recognizing a transition from suburb to state.

delner commented 7 years ago

I think I have another example of this with New York, without a suburb or state. In my case it seems to change how it parses the city based off the unit present?

> 1 west 4th st apt a new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt a",
  "city": "new york"
}

> 1 west 4th st apt 4 new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt 4",
  "city": "new york"
}

> 1 west 4th st apt 4a new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt 4a",
  "city": "new",
  "state": "york"
}

Interestingly enough it doesn't apply to all units (here's 1a instead of 4a):

> 1 west 4th st apt 1 new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt 1",
  "city": "new york"
}

> 1 west 4th st apt a new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt a",
  "city": "new york"
}

> 1 west 4th st apt 1a new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt 1a",
  "city": "new york"
}

Parses correctly if you add new york or ny as a state:

> 1 west 4th st apt 4a new york ny

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt 4a",
  "city": "new york",
  "state": "ny"
}

> 1 west 4th st apt 4a new york new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt 4a",
  "city": "new york",
  "state": "new york"
}
albarrentine commented 7 years ago

@delner same deal. "New York" is easy for our brains to disambiguate, but can be hard for a computer - it often means "city" and it often means "state". The way the features (machine learning features, not product features) are extracted in the currently-deployed model makes the "New York" difficult to distinguish from the "New York, New York" case, as explained in more detail above, so when "New York" is at the end of the string as a city, the current parser is more likely to get it correct in very frequent situations where the machine learning model would incur a large cost in terms of its error function for getting similar cases wrong. We generate unit numbers randomly in the training data (since they're not frequently used in OSM), and something like "Apt 1" is more frequently generated than something like "Apt 4A", so the model might not have enough examples of "4A" followed by "New York" the city to force itself to care about getting that highly ambiguous case (as far as it's concerned) wrong.

I've already implemented the changes proposed above i.e. keeping track of which word we're currently on in known multiword toponyms, which makes the "New York" city/state case much easier to distinguish for the model. A version I recently trained with those changes can correctly parse the "apt 4a" example as well:

> 1 west 4th st apt 4a new york

Result:

{
  "house_number": "1",
  "road": "west 4th st",
  "unit": "apt 4a",
  "city": "new york"
}

That model's overall accuracy improved to 99.56% on held-out test data, so assuming it probably fixes this entire class of errors, but not "Long Island City" because the suburb to state transition is not present in that version of the training data. I'll probably just lump those changes into the 1.1 release with the next batch of training data to avoid having too many models floating around, but if this version is desperately needed in the mean time, let me know.

Re: the "1a" case, that is a bit odd. Usually single digits get normalized to simply "D" in the internal representation, so there generally shouldn't be any differences in parses between "1a" vs. "4a" vs "9a", etc. We just consider the length of the digits as well as any letters that might be part of the token (in the next version it might be sensible to also normalize the letters in a numeric token to "X" so "1A", "2B", "9Z", etc. all normalize to "DX" - this is known in the named entity recognition literature as a "word shape"). If you type .print_featues into the address parser client, it will show the the model's representation or features of the input. This, combined with inspecting the training data with grep, etc. is one of the best ways to debug an incorrect parse in libpostal. Looking at the features, there apparently exists a "city" somewhere in the training data called "1a" (which is probably just an error in a user's OSM tag). We do remove boundary names that are strictly digits presently to account for mislabeled OSM tags, etc. but it might be worthwhile to extend the definition a bit for user-specified place names that are simply a number plus one (Latin) letter. User-specified cities can always be replaced with reverse-geocoded OSM boundaries, so if there really is a place with a digit + single letter name (the only counterexample in Latin script that's coming to mind is something like "9e Arondissement" in Paris but even there I don't think it can be written as simply "9e"), that example could be added to the OSM boundary.

delner commented 7 years ago

I see, I figured it might be a similar issue.

This isn't pressing for me exactly, since it works okay in version 0.3.2 and I can probably wait until 1.1 to upgrade. It's worth noting I did try this against the latest master as of yesterday (which didn't see to be fixed), so I assume it's not fixed in master yet.

Do you have an idea when this fix or 1.1 might roll out?

albarrentine commented 7 years ago

@delner that branch has not been merged in master, would mean deploying the new model and cutting a new release. The multiword toponym features mentioned above means changing the input variables that the parser uses to make predictions. It's important that the model have the same information at runtime that was used at training time so that the weights it learned are useful.

In 1.1 I'll be adding the release version to the S3 keys for the models so there can be multiple data updates (either because of something we changed on our side or just new dump from OSM) per version, but each time there's a model change, we bump the minor version and create a new keyspace for models trained on that version.

The OSM dump starts tonight around 1am GMT and contains a change or two from this week that I wanted to go in to the release. That finishes sometime on Wednesday, and I'm still testing a few changes to training data generation to hopefully correct for various parser issues that have been reported. The full data generation and training process takes a couple of weeks at present, so likely end of May or early June.

delner commented 7 years ago

Awesome. Thanks for being so responsive @thatdatabaseguy ! I'm looking forward to this.