ropensci / rrrpkg

Use of an R package to facilitate reproducible research
255 stars 25 forks source link

rrrpkg

Use of an R package to facilitate reproducible research

What is a research compendium?

We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,...), and as a means for distributing, managing and updating the collection. - Gentleman, R. and Temple Lang, D. (2004)

The goal of a research compendium is to provide a standard and easily recognisable way for organising a reproducible research project with R. A research compendium is ideal for projects that result in the publication of a paper because then readers of the paper can access the code and data that generated the results in the paper. A research compendium is a convention for how you organise your research artefacts into directories. The guiding principle in creating a research compendium is to organise your files following conventions that many people use. Following these conventions will help other people instantly familiarise themselves with the structure of your project, and also support tool building which takes advantage of the shared structure.

Some of the earliest examples of this approach can be found in Robert Gentleman and Duncan Temple Lang's 2004 paper "Statistical Analyses and Reproducible Research" Bioconductor Project Working Papers and Gentleman's 2005 article "Reproducible Research: A Bioinformatics Case Study" in Statistical Applications in Genetics and Molecular Biology. Since then there has been a substantial increase in the use of R as a research tool in many fields, and numerous improvements in the ease of making R packages. This means that making a research compendium based on an R package is now a practical solution to the challenges of organising and communicating research results for many scientists.

Why create a research compendium?

Using research compendia simplifies file management and streamlines analytical workflows, making your research more efficient. A compendium makes it easier to communicate your work with other researchers (and your future self), to demonstrate the correctness of your results. This can lead to higher visibility of your work, receiving credit for code as well as the paper, a boost in citations, and allows others to more easily build on your work.

How to make a research compendium

At its simplest, a research compendium might consist of a single file of R code with inline comments documenting the workflow. A slightly more complex approach might be a R markdown file with text and code in the same source document. In many cases these simple approaches will be ideal, and more elaborate organisation would add unnecessary complexity and points of failure. But many projects will require some additional organisation to make them easier to work with. An ideal organisation for a more complex project would look like this:

If you’re familiar with R packages, you’ll notice many similarities with these conventions. But there are some differences:

More complex research compendia include other package elements such as a licence, tests, continuous integration, and dependencies external to R, such as a dockerfile to replicate the computational environment that the analyses were originally conducted in.

How to share a research compendium

You should prepare your compendium using a version control system such as git. Then when you are ready to share it, the best way is to archive a specific commit of your compendium at a repository that issues permanent URLs such as figshare or zenodo which give DOIs for archived files. Then you can circulate the version of your compendium that is the version that generated the published results. This means you have a publicly available snapshot of the code that matches the paper. Code development can continue after the paper is published, but with a DOI that links to a specific commit, other users of the code can be confident that they have the version that matches the paper. A DOI also simplifies citation of the compendium, so you can cite it in your paper (and others can cite it in their work) using a persistent URL.

Putting your compendium on dropbox or google drive is another way to make the compendium easily available.

Getting started with a research compendium

project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code 
|  +- my_scripts.R      # R code used to analyse and visualise data 

A real-world example of this simple research compendium format is online here: https://github.com/duffymeg/BroodParasiteDescription

project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|- NAMESPACE            # exports R functions in the package for repeated use
|- LICENSE              # specify the conditions of use and reuse of the code, data & text
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code 
|  +- my_report.Rmd     # R markdown file with R code and narrative text interwoven
|
|- R/                   #  
|  +- my_functions.R    # custom R functions that are used more than once in the project
|
|- man/
|  +- my_functions.Rd   # documentation for the R functions (auto-generated when using devtools)

This intermediate example includes the R/ and man/ directories. These contain custom functions that are used repeatedly throughout the project. The man/ directory contains the manual, or documentation on the use of the functions. The NAMESPACE and LICENSE files are also typical features of R packages.

For example, https://github.com/USEPA/LakeTrophicModelling has much of the repeatable code in R/ and the remainder of the code and text in vignettes/manuscript.Rmd.

project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|- NAMESPACE            # exports R functions in the package for repeated use
|- LICENSE              # specify the conditions of use and reuse of the code, data & text
|- .travis.yml          # continuous integration service hook for auto-testing at each commit
|- dockerfile           # makes a custom isolated computational environment for the project
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code
|  +- my_report.Rmd     # R markdown file with narrative text interwoven with code chunks 
|  +- makefile          # builds a PDF/HTML/DOCX file from the Rmd, code, and data files
|  +- scripts/          # code files (R, shell, etc.) used for data cleaning, analysis and visualisation 
|
|- R/                     
|  +- my_functions.R    # custom R functions that are used more than once throughout the project
|
|- man/
|  +- my_functions.Rd   # documentation for the R functions (auto-generated when using devtools)
|
|- tests/
|  +- testthat.R        # unit tests of R functions to ensure they perform as expected

Real-world examples that are similar to this more complex research compendium format are online here:

Note that although these real-world examples have a common basic R package structure, they show quite a bit of variation in the location of things like the dockerfile, and the use of package features like the inst/ and vignettes/ directories. This kind of variation does not affect the function of the compendium as a package, and largely reflects personal choices about what kind of file organisation makes the most sense to each researcher.

Useful tools and templates for making research compendia

These templates are empty packages that show various ways of organising an analysis as an R package (eg. where the manuscript is the package vignette, or similarly bundled with the package)

Challenges for future work

Further reading

rOpenSci Guide to Reproducible Research

Gandrud C 2013 Reproducible Research with R and RStudio. CRC Press Florida

Gentleman, R. and Temple Lang, D. (2007). Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics 16, 1–23

Gentleman, R. and Temple Lang, D. "Statistical Analyses and Reproducible Research" (May 2004). Bioconductor Project Working Papers.

Stodden, V and Miguez, S 2014. Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research. Journal of Open Research Software 2(1):e21, DOI: http://dx.doi.org/10.5334/jors.ay

Wickham, Hadley, R Packages: Organise, test, document, and share your code. O’Reilly.

Colophon

This document was the result of discussions at the 2015 rOpenSci unconference (cf. https://github.com/ropensci/unconf/issues/11 and https://github.com/ropensci/unconf/issues/31). Contributors to the discussion include... [if you were in the rOpenSci unconf breakout on this topic please add your name via a Pull Request]. This document was initially drafted by Hadley Wickham, with later contributions from Ben Marwick. Additional contributions are welcome! Please post an issue to ask questions and discuss suggestions.