Roadmap for testdat - Githubissues

karthik commented 10 years ago

Here's a quick roadmap for the package. The goal is to have a full test suite that folks can run on their tabular data to identify problems and issues. These can be as common as finding UTF-8 characters, unintended spaces in cells, and also finding malformed characters (eg. date patterns).

Right now I have a dozen or ‘messy’ datasets to work with.

Basic function to implement

[ ] Pattern matching. Test that the data in a vector matches a regex pattern
[ ] Length (all data in a vector are of a specified length).
[x] Check for extra spaces
[ ] Check for missing values.
[x] Check for outliers. Somewhat tricky but a common use case would be to identify typos. e.g. 1.5, 1.6, 1.98, 17

Question: Would it be worth implementing a set of matching functions to fix the issues as well? With code unit testing one can only identify problems and point out where fixes need to occur. Here we can actually go through and clean everything up.

sckott commented 10 years ago

WRT your question: seems like yes to me.

I wonder if you could have the function that detects the errors collect them (e.g., in attributes element of an S3 object), then the function that fixes the errors simply pulls in that metadata of what errors to fix, fixes them, and returns the fixed dataset.

sckott commented 10 years ago

WRT checking for outliers: A simple wrapper function around GGally::ggpairs might be useful for visually looking for outliers across any set of columns.

karthik commented 10 years ago

Sounds like a great idea @sckott. Feel free to add stuff to the package if you have any interest for working on it.

Interesting idea re: the S3 object. I'll think on it some more but seems like it could be useful in a provenance context. Like

original_data <- read.csv(...)
issues <- test_dat("Testing data for following issues", {
                             ...
                            })
clean_data <- fix_dat(original_data, issues)

But for first pass it might just be simple to have a few function calls to fix issues.

karthik commented 10 years ago

Just realized that what testdat would do is become a programmatic equivalent of Google Refine (now Open Refine).

sckott commented 10 years ago

good analogy

emhart commented 10 years ago

WRT to outliers, a useful addendum to @sckott suggestion of a plot would be to add a set of criterion to establish outliers, these could be percentiles, or maybe standard deviations if it's normalish data, and then plot them with different colors. Perhaps even numbers by the points so people can quickly identify the row number. I can try and whip something up. Here's a definition from NIST.

emhart commented 10 years ago

Looks like putting numbers near the points might be pretty cramped. Here's a gist that will create a plot, is this what you had in mind? (before I incorporate it in to the package) https://gist.github.com/emhart/9025719

sckott commented 10 years ago

Looks good to me @emhart - But you could just number the points that are outliers, right?, like the 5% outliers, or 2.5% or whatever.

emhart commented 10 years ago

Yeah, it looks like just labeling the outliers would still be tight.

sckott commented 10 years ago

We could try out an interactive shiny/rcharts version where you can hover over points to get their metadata?

karthik commented 10 years ago

Sounds great. I've jotted down all these for implementation.

On Sat, Feb 15, 2014 at 5:07 PM, Scott Chamberlain <notifications@github.com

wrote:

We could try out an interactive shiny/rcharts version where you can hover over points to get their metadata?

Reply to this email directly or view it on GitHubhttps://github.com/ropensci/testdat/issues/1#issuecomment-35174127 .

karthik / testdat

Roadmap for testdat #1