Classes/methods for evidence-based data analytic pipelines

My motivation for this issue comes from the following basic problem that I encounter every day. A scientist collects data for a given question in a given subject area and wants to be able to apply the latest and best statistical approaches to analyze the data. They come to me and ask "what do I do"? I say there are 4-5 different packages that implement some version of the latest approach. In particular, each package implements a series of functions, with help pages, but perhaps not much else in guidance in how to apply each of the functions. Occasionally, there is a vignette, but not always. They say, okay but "what do I do"?

To me, the problem is we don't have a general framework for curating the best statistical methods out there to create a sound and evidence-based data analytic pipeline for common problems in many scientific areas. In my area of environmental health, we have some well-studied problems, but no established pipelines of analyzing the data from these problems. There area few areas are a bit further ahead than we are, but I think that is the exception.

Some initial thoughts on this idea can be found in this PNAS paper and in this blog post by Jeff Leek.

The goal would be to establish

A way by which statistical methods could be plugged and pasted together to create an analytic pipeline
A mechanism for benchmarking these methods (using perhaps benchmark datasets) so that better methods can be swapped in and older methods swapped out
A report generating mechanism that shows all the analyses conducted, to allow others to see what was done and to limit selective inference
A library of these pipelines so that users could pick and choose amongst them given the type of data/problem they have

There is obviously other software that implements some of this functionality (Galaxy comes to mind), and I think it would be good to steal what we can from them.

Okay, this obviously cannot be done in 2 days, but maybe some pieces of it could be started....

I love this idea and want to work on it. Making a modular pipeline to compare different statistical methods and tools to combine the output from the various methods would be incredibly useful.

I think we could make progress on the first point using ideas similar to @dgrtwo's broom and biobroom packages. (side note: I would love to work on expanding biobroom for not just ExpressionSets, but MethylSets & bioconductor packages for differential methylation).

I would agree with this. I've been struggling with similar problems in bioinformatics... there are a ton of packages, but it's not always clear how to go from "data I have" to "required input for methods implemented in this package" and documentation can be fairly sparse. The CRAN task views are nice for getting started on broad problems, but they too lack additional documentation, benchmarking, etc.

One (partial) solution to the lack of vignettes illustrating the entire analysis pipeline would be to create some repository where users can post reproducible analyses (maybe in an iPython notebook like setting?) that can be tagged as relevant to specific steps in some topical pipeline/flowchart schematic. I'm not sure how this would be best implemented, and it would require a certain "critical mass" to be effective, but I could see a resource like that being incredibly useful.

On Sun, Mar 1, 2015 at 11:08 AM, Roger D. Peng notifications@github.com wrote:

My motivation for this issue comes from the following basic problem that I encounter every day. A scientist collects data for a given question in a given subject area and wants to be able to apply the latest and best statistical approaches to analyze the data. They come to me and ask "what do I do"? I say there are 4-5 different packages that implement some version of the latest approach. In particular, each package implements a series of functions, with help pages, but perhaps not much else in guidance in how to apply each of the functions. Occasionally, there is a vignette, but not always. They say, okay but "what do I do"?

To me, the problem is we don't have a general framework for curating the best statistical methods out there to create a sound and evidence-based data analytic pipeline for common problems in many scientific areas. In my area of environmental health, we have some well-studied problems, but no established pipelines of analyzing the data from these problems. There area few areas are a bit further ahead than we are, but I think that is the exception.

Some initial thoughts on this idea can be found in this PNAS paper http://arxiv.org/abs/1502.03169 and in this blog post by Jeff Leek http://simplystatistics.org/2014/12/04/repost-a-deterministic-statistical-machine/ .

The goal would be to establish

A way by which statistical methods could be plugged and pasted together to create an analytic pipeline

A mechanism for benchmarking these methods (using perhaps benchmark datasets) so that better methods can be swapped in and older methods swapped out

A report generating mechanism that shows all the analyses conducted, to allow others to see what was done and to limit selective inference

A library of these pipelines so that users could pick and choose amongst them given the type of data/problem they have

There is obviously other software that implements some of this functionality (Galaxy comes to mind), and I think it would be good to steal what we can from them.

Okay, this obviously cannot be done in 2 days, but maybe some pieces of it could be started....

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/18.

This is an excellent idea. I'd be very interested in working on expanding broom and especially biobroom during the UnConf, but also on these other ideas.

A way by which statistical methods could be plugged and pasted together to create an analytic pipeline

I think that dplyr lays the perfect groundwork for plugging-pasting analytic methods into a pipeline. broom also helps by turning model outputs tidy so the pipeline can continue, but I think there's more to be done. For starters, tools often have different inputs (different preprocessing, argument names, formats, etc). In the subSeq package, for instance, I wanted to compare multiple RNA-Seq analysis methods and therefore created handlers for each that took the same input and gave consistent output (always containing columns for coefficient and p.value).

A mechanism for benchmarking these methods (using perhaps benchmark datasets) so that better methods can be swapped in and older methods swapped out

Great- also relies on having output-tidy methods (so they can be combined, compared, and replaced). I've done some work on benchmarking and comparisons with both dplyr and purrr and I think it's a rich area.

A report generating mechanism that shows all the analyses conducted, to allow others to see what was done and to limit selective inference

Perhaps one could register a session (using addTaskCallback) and then every call to a modeling function, along with its results, could be recorded. Then a generate_report function could generate a knitr document describing them. The more consistent the analysis is (in some kind of "modeling grammar", the way that dplyr provides a data manipulation grammar) the better this could work.

A library of these pipelines so that users could pick and choose amongst them given the type of data/problem they have

Is anyone interested in building on @jennybc's R Graph Catalog? I think this kind of gallery/library, not just for graphs but for any kind of example analysis, could be invaluable (its use of tagging and searching is great). It should be made easy to submit and curate an example (probably in the form of knitr documents).

I love the idea of building a sort of R Graph Catalog but for analyses/models. It would be nice to have short vignettes along with the code, describing the type of data inputs that are appropriate and how to interpret the output.

ropensci / unconf15

Classes/methods for evidence-based data analytic pipelines #18