ishiland / nyc-parser

Parse single line NYC addresses
1 stars 0 forks source link

Handling letters in house numbers #1

Open SPTKL opened 5 years ago

SPTKL commented 5 years ago

Hey @ishiland this is an awesome package, we've been using https://github.com/datamade/usaddress for address parsing (but we have problems when parsing cross streets or place names), but would love to adopt your nyc-parser one issue I found is that house numbers containing letters gets wrong parsing results. e.g.

>>> p.address('188-60 REAR 120 ROAD, Queens, New York, NY, USA')
{'PHN': '188-60', 'STREET': 'REAR 120 ROAD', 'BOROUGH_CODE': 1, 'BOROUGH_NAME': 'MANHATTAN', 'ZIP': None}

should be instead

>>> p.address('188-60 REAR 120 ROAD, Queens, New York, NY, USA')
{'PHN': '188-60 REAR', 'STREET': '120 ROAD', 'BOROUGH_CODE': 1, 'BOROUGH_NAME': 'MANHATTAN', 'ZIP': None}
ishiland commented 5 years ago

Usaddress is good. The purpose of this lib is for a lighter weight solution that is specific to NYC addresses and its edge cases. It may make more sense to use usaddress depending on your needs.

As for this particular issue, I am making an assumption that all phns and street names are separated with white space. The issue is here: https://github.com/ishiland/nyc-parser/blob/fc79d8127b85da9f07c5fa47cc0372eb70361b37/nycparser/nycparser.py#L29-L30

Does usaddress successfully parse this address? If are able to provide test data for issues like these that would be very helpful, thanks!

SPTKL commented 5 years ago

Actually usaddress is failing on this address too, it is labeling REAR as part of the street name sometimes it would correctly label rear or front as address number suffixes. I think I'm going to train a better usaddress model using the PAD data. or we can just create some kind of exception for rear and front, but then we have tricky cases like below

141 FRONT MOTT STREET, Manhattan, New York, NY, USA and

141 FRONT STREET, Manhattan, New York, NY, USA

ishiland commented 5 years ago

Those seem like difficult scenarios to account for. I'd be interested in knowing if you have any success using the PAD data to train usaddress. If not, we can work in some kind of solution with nyc-parser. Perhaps I should write some better tests with the PAD data.