openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.06k stars 417 forks source link

fractional numbers parsed as road name in some cases #191

Open missinglink opened 7 years ago

missinglink commented 7 years ago

heya, I just noticed the way fractional house numbers are being parsed has changed with the recent release.

the top shows the old model we were running here at mz and the lower one is the new model from master.

> 537½ west 26th street, new york

Result:

{
  "house_number": "537 1 / 2",
  "road": "west 26th street",
  "state": "new york"
}
> 537½ west 26th street, new york

Result:

{
  "house_number": "537",
  "road": "½ west 26th street",
  "city": "new york"
}
albarrentine commented 7 years ago

Hm, that definitely doesn't happen for all vulgar fractions. One of the Jamaican test cases is "16½ Windward Road", which the new parser gets perfectly.

I think I do see one problem though for certain types of US addresses (using the .print_features command in the address_parser client prints the input to the model for each token and makes most of these things clear).

In particular, vulgar fractions are considered numeric tokens, so they get normalized to just "D", which is the same normalization used for a one-digit number. So that's probably why. Easy to fix for the next build, but requires training a new parser (at least a week).

missinglink commented 7 years ago

:+1: