added tests that demonstrate library weaknesses

hollanddd commented 10 years ago

pyaddress has problems with multi word cities.

I couldn't find the issues tab to submit under so here goes: We shouldn't stop processing when there is no house number. It is perfectly valid for a house to not have a street number: e.g. a majority of Carmel California There are some naming issues. pyaddress is missing street post direction. e.g. 12006 120th Pl NE, Kirkland, WA ambiguous naming: apartment should be unit_designator or simply designator. designators include apartment, unit, suite, department, etc. unit numbers should go into a designator_number. eg unit 38 should not lose it's designation More than anything here needs to be consistency in the naming. we should refer to USPS Publication 28: http://pe.usps.com/text/pub28/welcome.htm

joshgachnang commented 10 years ago

I added these as a few issues. You are very correct about all of it. When I was originally designing it, it was focused on an apartment search site that served the Midwest, so I only considered issues that were present in the major cities we were serving. Now I'd like this to be a more general library (and I'm thinking eventually an API, because this only helps the Python community. A REST API would make it accessible to every language).

hollanddd commented 10 years ago

That's inline with my goals for this project as well. Parsing address' is no easy task and I'm glad to have someone to bounce ideas off of.

joshgachnang commented 10 years ago

Well, let me bounce a few ideas then.

Maybe we can recursively parse the string, similar to what I did before, but removing parts that we can guarantee are in the correct position and passing the remaining back into the parser, along with the known data (so we aren't looking for the zip again after we confirm we have the zip). So, once we identify the zip code, we can do a lookup and see if the city and state are part of the address, and remove them as well. Then we'd be left with the street number, name, prefixes and suffixes, and parse accordingly.

It may also be possible to pull all the addresses available in OpenStreetMaps into a file for parsing. That data would allow use to check our guesses against known data. The data could be kept up to date with the diffs OpenStreetMaps releases weekly with new data. I expect it would be very large, but usable and a good replacement for having to run a DSTK server. I don't know if it would useful for the library (could be too big), but certainly for an API.

joshgachnang commented 10 years ago

Also, if we can find a pool of data along with correct data, we could train something like a Named Entity Recognizer. As the data grows larger, the recognition should get better.

hollanddd commented 10 years ago

I liked your idea of assigning a probability to the parts and I totally agree about isolating the zip and then looking at a reference to get city and state. It would be great to then look at street names by that city/state combination. I would have to look further into open street maps before I could speak to that. I would like to keep this as light as possible. It would also be worth looking into the USPS provided services for assessing address deliverability for use in the api.

joshgachnang / pyaddress

added tests that demonstrate library weaknesses #1