Open waldoj opened 11 years ago
After researching all three, I'm inclined to go with @open-city's library. @cjdd3b's appeals to me, but I think the former is going to be simpler to get running.
A few months later, I'm not so sure. I'm keeping these data in ElasticSearch now, and I'm persuaded that it's the proper vehicle for manipulating this data. But of course it remains essential to de-dupe donor and vendor records.
ElasticSearch's FuzzyLikeThis query is promising. Here's a blog entry on the topic.
I'm also thinking that this problem could be offloaded, by geocoding every address (basically farming out the problem to a more intelligent service), and then using the lat/lon pair combined with the name to figure out if it's the same vendor / contributor. At .8¢/query via Yahoo, that could get expensive fast. (Over $900, by my math!) Google sells the same service, but it's apparently so expensive that they're not even naming the price. :-/
There are a few good tools for this:
https://github.com/cjdd3b/fec-standardizer https://github.com/huffpostdata/campfin-linker https://github.com/open-city/dedupe