sc-ravagr / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

Request: Tools for data verification and comparison. #384

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Tools for data verification and comparison. 

Ultimately, I would like to be able to compare to a valid set of data A to a 
unrefined set B returning a % in validity. I'm not asking for a simple exact 
match of string but rather contextual sense of correctness. 

E.G> refining a list of addresses. Users may input erroneous or invalid entries 
which need to be corrected; common typos. These addresses may be similar enough 
to be matched and refined by comparing it to another set of data containing 
valid addresses.

Much like the function of Google search which suggests correction for mistyped 
values.

Original issue reported on code.google.com by EHOPstore@gmail.com on 19 May 2011 at 7:28

GoogleCodeExporter commented 9 years ago
I'm having trouble understanding what form this would take.  Is it some user 
defined set of rules (and associated rule language) or ...  

Is the % a probability that an value is correct or is it the percentage of 
values which are valid (implying that Refine would have to be able to discern 
valid/invalid with 100% accuracy).

If you could expand on what your envisioning, that might help developers figure 
out how hard it would be to implement and whether it fits with the goals of 
Refine.

Original comment by tfmorris on 25 May 2011 at 5:29

GoogleCodeExporter commented 9 years ago
EHOPstore, you might be further interested or have an investment interest in 
using some of the USPS Address Verification & Address Quality solutions at 
http://www.usps.com/business/addressverification/welcome.htm  or contact them 
directly like I have done in the past: 
http://www.usps.com/ncsc/ziplookup/contactinfo.htm  At my job we have a Talend 
ETL process at night that scrubs one of our databases against one of those 
vendor software packages (CASS Certified) and we review using our AEC and even 
clean sometimes manually with simple tools, including Refine at times.  You 
might also look at Orange http://orange.biolab.si and perhaps try a learner and 
classifier solution to the problem, if the data set sample is large enough to 
support predictions. (just ask them for help on their forum).  Your probably 
looking for something like a custom reconciliation service to use in Refine 
that would utilize a CASS certified vendor address verification (or if you 
don't need a solution to be CASS certified, then perhaps alternatively a Google 
Maps API Premier license or another web api factory out there)  Outside of what 
I just mentioned, you can certainly do a lot with just Faceting, Splitting, and 
Crossing between project data sets as demonstrated here: 
http://feedproxy.google.com/~r/ouseful/~3/yCUHpNJghxo/

Original comment by thadguidry on 26 May 2011 at 4:01