openva / saberva

A parser for bulk campaign finance data files provided by the Virginia State Board of Elections.
MIT License
12 stars 2 forks source link

De-dupe donor and vendor data #10

Open waldoj opened 11 years ago

waldoj commented 11 years ago

There are a few good tools for this:

https://github.com/cjdd3b/fec-standardizer https://github.com/huffpostdata/campfin-linker https://github.com/open-city/dedupe

waldoj commented 11 years ago

After researching all three, I'm inclined to go with @open-city's library. @cjdd3b's appeals to me, but I think the former is going to be simpler to get running.

waldoj commented 10 years ago

A few months later, I'm not so sure. I'm keeping these data in ElasticSearch now, and I'm persuaded that it's the proper vehicle for manipulating this data. But of course it remains essential to de-dupe donor and vendor records.

ElasticSearch's FuzzyLikeThis query is promising. Here's a blog entry on the topic.

I'm also thinking that this problem could be offloaded, by geocoding every address (basically farming out the problem to a more intelligent service), and then using the lat/lon pair combined with the name to figure out if it's the same vendor / contributor. At .8¢/query via Yahoo, that could get expensive fast. (Over $900, by my math!) Google sells the same service, but it's apparently so expensive that they're not even naming the price. :-/