Open karthik opened 10 years ago
WRT your question: seems like yes to me.
I wonder if you could have the function that detects the errors collect them (e.g., in attributes element of an S3 object), then the function that fixes the errors simply pulls in that metadata of what errors to fix, fixes them, and returns the fixed dataset.
WRT checking for outliers: A simple wrapper function around GGally::ggpairs
might be useful for visually looking for outliers across any set of columns.
Sounds like a great idea @sckott. Feel free to add stuff to the package if you have any interest for working on it.
Interesting idea re: the S3 object. I'll think on it some more but seems like it could be useful in a provenance context. Like
original_data <- read.csv(...)
issues <- test_dat("Testing data for following issues", {
...
})
clean_data <- fix_dat(original_data, issues)
But for first pass it might just be simple to have a few function calls to fix issues.
Just realized that what testdat would do is become a programmatic equivalent of Google Refine (now Open Refine).
good analogy
WRT to outliers, a useful addendum to @sckott suggestion of a plot would be to add a set of criterion to establish outliers, these could be percentiles, or maybe standard deviations if it's normalish data, and then plot them with different colors. Perhaps even numbers by the points so people can quickly identify the row number. I can try and whip something up. Here's a definition from NIST.
Looks like putting numbers near the points might be pretty cramped. Here's a gist that will create a plot, is this what you had in mind? (before I incorporate it in to the package) https://gist.github.com/emhart/9025719
Looks good to me @emhart - But you could just number the points that are outliers, right?, like the 5% outliers, or 2.5% or whatever.
Yeah, it looks like just labeling the outliers would still be tight.
We could try out an interactive shiny/rcharts version where you can hover over points to get their metadata?
Sounds great. I've jotted down all these for implementation.
On Sat, Feb 15, 2014 at 5:07 PM, Scott Chamberlain <notifications@github.com
wrote:
We could try out an interactive shiny/rcharts version where you can hover over points to get their metadata?
Reply to this email directly or view it on GitHubhttps://github.com/ropensci/testdat/issues/1#issuecomment-35174127 .
Here's a quick roadmap for the package. The goal is to have a full test suite that folks can run on their tabular data to identify problems and issues. These can be as common as finding UTF-8 characters, unintended spaces in cells, and also finding malformed characters (eg. date patterns).
Right now I have a dozen or ‘messy’ datasets to work with.
Basic function to implement
1.5, 1.6, 1.98, 17
Question: Would it be worth implementing a set of matching functions to fix the issues as well? With code unit testing one can only identify problems and point out where fixes need to occur. Here we can actually go through and clean everything up.