openva / crump

A parser for the Virginia State Corporation Commission's business registration records.
https://vabusinesses.org/
MIT License
20 stars 3 forks source link

Normalize addresses #57

Open waldoj opened 10 years ago

waldoj commented 10 years ago

The provided addresses aren't normalized in any way. Actual street addresses:

4142 MELROSE AVE., N.W., NO.9 1125 JERNIGAN AVE APT H 204 A EAST BELLEFONTE AVENUE 2730 WEST TYVOLA ROAD, FOUR COLISEUM CENTRE COURTHOUSE PLAZA NE ONE TOWER SQUARE SIX MANHATTAN SQUARE WORLD HEADQUARTERS, ONE ELMCROFT ROAD 320-18TH STREET

There are a lot of addresses that are not useful. We need to normalize these addresses. Also, it'd be great if we could turn this into mixed-case.

waldoj commented 10 years ago

This might be solvable via TAMU's geocoding API. I'm hesitant to use an API for this, because we're dealing with a very large number of addresses, and because then it becomes necessary to cache geocoded addresses, to avoid ever asking TAMU to geocode the same address more than once.

waldoj commented 10 years ago

Now I'm wondering how we might go about caching these. I'm thinking SQLite. I've been needing to learn to use it for simple things like this, and it certainly seems like the right tool for the job.

The key can be the hash of the raw address. I was thinking we'd have multiple tables—one for each file—but on reflection, I can't see what the need is for that. So we'd store the hash, the raw address, and then the address returned by the API. If the API returns an error (an error based on the content of the address, rather than "quota exceeded" or a 404 or something), then just store the original address as the normalized address.

This certainly seems easy to do.

waldoj commented 10 years ago

As an added nicety, we could even provide an exported copy of the database on AWS. The first time that a client attempts to clean up addresses, it would download the database and load it into SQLite, to avoid the time (and server abuse) of re-cleaning every address with TAMU's API.

Since TAMU requires an API key for this service, I guess we're going to have to add a config file. Perhaps at first we can just store the API key within the script, to put off creating that handler.

waldoj commented 10 years ago

Aaaand this is no longer a good option. :-/ They now charge for anything beyond 2,500 queries—geocoding all of this data would cost north of $1,000. (It's possible that this isn't a recent change, and I just didn't notice it before.) Argh.

waldoj commented 10 years ago

Also possible: geocode instead of normalize. This falls down in multi-unit buildings, but it might still be superior to the current state of things, which is basically noise.

waldoj commented 10 years ago

We can start towards solving this via @datamade's US Address. It's a tokenizer, but not a normalizer, so it won't do anything to turn "Street" into "St" or "East Main Street" into "E Main St" or anything like that, but it will allow us to do a bit of our own normalization.

waldoj commented 10 years ago

The Census Bureau's API looks like it might be good for this. I need to test it out with some odd addresses first.

waldoj commented 9 years ago

The Census API won't work, because it strips out anything it doesn't recognize as valid, and its ability to recognize validity is too limited. For instance, 110B 2nd St. NE turns into 110 2nd St NE, despite that 110B is a perfectly legitimate (and extant) address.

waldoj commented 9 years ago

Ehhhh. Maybe it's OK. Maybe it's Good Enough™.

waldoj commented 9 years ago

We're already saving the cleaned-up address, from whatever source the geocoder obtains it from. (Esri API for Virginia, Census API for extra-Virginia.) I think we can do this—I think we can start substituting the returned address when the transform option is invoked.

waldoj commented 9 years ago

Ah, right, there's one catch—we have to break up addresses. e.g., 123 E MAIN ST, CHARLOTTESVILLE, VA, 22902 needs to be reduced to 123 E MAIN ST, CHARLOTTESVILLE, VA, AND 22902. So a tokenizer is necessary. usaddress, I assume.

waldoj commented 9 years ago

Hey, Mapzen makes a tool that'll help. It's an actual normalizer, instead of a tokenizer, which is better for this use case.