Open waldoj opened 10 years ago
This might be solvable via TAMU's geocoding API. I'm hesitant to use an API for this, because we're dealing with a very large number of addresses, and because then it becomes necessary to cache geocoded addresses, to avoid ever asking TAMU to geocode the same address more than once.
Now I'm wondering how we might go about caching these. I'm thinking SQLite. I've been needing to learn to use it for simple things like this, and it certainly seems like the right tool for the job.
The key can be the hash of the raw address. I was thinking we'd have multiple tables—one for each file—but on reflection, I can't see what the need is for that. So we'd store the hash, the raw address, and then the address returned by the API. If the API returns an error (an error based on the content of the address, rather than "quota exceeded" or a 404 or something), then just store the original address as the normalized address.
This certainly seems easy to do.
As an added nicety, we could even provide an exported copy of the database on AWS. The first time that a client attempts to clean up addresses, it would download the database and load it into SQLite, to avoid the time (and server abuse) of re-cleaning every address with TAMU's API.
Since TAMU requires an API key for this service, I guess we're going to have to add a config file. Perhaps at first we can just store the API key within the script, to put off creating that handler.
Aaaand this is no longer a good option. :-/ They now charge for anything beyond 2,500 queries—geocoding all of this data would cost north of $1,000. (It's possible that this isn't a recent change, and I just didn't notice it before.) Argh.
Also possible: geocode instead of normalize. This falls down in multi-unit buildings, but it might still be superior to the current state of things, which is basically noise.
We can start towards solving this via @datamade's US Address. It's a tokenizer, but not a normalizer, so it won't do anything to turn "Street" into "St" or "East Main Street" into "E Main St" or anything like that, but it will allow us to do a bit of our own normalization.
The Census Bureau's API looks like it might be good for this. I need to test it out with some odd addresses first.
The Census API won't work, because it strips out anything it doesn't recognize as valid, and its ability to recognize validity is too limited. For instance, 110B 2nd St. NE
turns into 110 2nd St NE
, despite that 110B
is a perfectly legitimate (and extant) address.
Ehhhh. Maybe it's OK. Maybe it's Good Enough™.
We're already saving the cleaned-up address, from whatever source the geocoder obtains it from. (Esri API for Virginia, Census API for extra-Virginia.) I think we can do this—I think we can start substituting the returned address when the transform
option is invoked.
The provided addresses aren't normalized in any way. Actual street addresses:
There are a lot of addresses that are not useful. We need to normalize these addresses. Also, it'd be great if we could turn this into mixed-case.