openva / saberva

A parser for bulk campaign finance data files provided by the Virginia State Board of Elections.
MIT License
12 stars 2 forks source link

Normalize addresses #17

Open waldoj opened 11 years ago

waldoj commented 11 years ago

Address data does not appear to be normalized. For instance, Virginia Engineers PAC's 09/07/2012 contribution says that their primary place of business is "Richmon, VA." (To be fair, that's not address data.) It appears that the software used by many committees is providing normalization, but they normalize differently. For instance, some software normalizes on long street suffixes ("Court," "Boulevard," "Road," etc.), while some software normalizes on short street suffixes ("Ct.," "Blvd.," "Rd.," etc.) So the good news is that reports often have internal consistency that should make it easy to join all of the reports in collective consistency.

Implement the an address normalization system (presumably the USPS's API) to deal with this problem.

The only question is at what point this should be done. Is it appropriate to do this prior to saving the data and generating the JSON? Or is it wrong to alter the SBE's data? Wouldn't this mean making tens of thousands of API calls every time that the parser is run?

This might be an argument for standardizing addresses via a cruder, local function at the time of input, and save the USPS API calls to be used beyond the Saberva pipeline.

waldoj commented 11 years ago

I signed up for the USPS API months ago, getting as far as the part where you wait for approval. I never heard back. So that ain't gonna happen for people.

waldoj commented 9 years ago

The Census Bureau's geocoding API might be the way to do this.