Open egouldo opened 6 years ago
Great write-up @egouldo, and thanks for the links @njtierney! Can't wait to have a play with liftr this weekend. I wonder how extendable it is for an analysis that isn't conducted within an .Rmd.
This sounds very cool! You might also want to take a look at https://github.com/o2r-project/containerit, another R package with somewhat similar aspirations.
Another related thing might be to create a Dockerfile + binder badge/link, allowing users to run the code on Binder without having to ever mess with installing Docker or anything else locally. (see binder badge in this example, https://github.com/cboettig/noise-phenomena, launches the Dockerfile in the repo on binder). Related efforts to this also include emerging platforms like https://wholetale.org/ (which also uses Rocker images) and https://codeocean.com/
I was just showing @egouldo and @stevekambouris noise phenomenon earlier today. Great example!
@nuest we are going to hack away at some of the open issues on containerit! head's up =)
our fork's at ropenscilabs/containerit
Problem
During the course of a computational replication, many sources of error might arise, causing the replication to fail. One critical component in a computational replication is the computing environment -- should, for example, any dependencies, like R packages, be no longer available, anyone wishing to reproduce your analyses in R will be unable to do so. Tools like Docker and The Rocker Project provide completely containerised environments -- including all dependencies -- for reproducing R analyses and projects.
Unfortunately this model of facilitating computational reproducibility across machines and analysts is extremely difficult to implement for the regular R user wishing to time-capsule their work. Specialised knowledge, and a good deal of time, is needed to get docker up and running. Some folk might not even know that Docker exists!
Consequently one of the most common models of open science involves authors submitting data and code to repositories like Dryad, and then providing the link inside their journal article. Whilst this ticks the transparency box of open science, it certainly does not guarantee reproducibility, for the reasons exemplified above.
Proposed solution
The fundamental objective is to create some sort of a time-capsule:
The goal of the package, and the Shiny App, if we get there, is to create a "docker-like" system where the user can:
a) match the environment such that you can at least get the code to run b) run the code, in a make-like manner c) access the computing environment such that you can engage with raw, intermediate, and output objects in the data analysis pipeline of a scientific study to check the validity of the coding implementation of its analyses.
It should make the process of going from code, data, packages, and some set of assemblage instructions --> docker EASY!! The ultimate aim of making this process easy is to increase the generation of more reproducible scientific outputs, such that independent analysts can 1. obtain, and 2. re-run scientific analyses -- and, hopefully, reproduce them!
Thanks:
Thank you to @smwindecker and @stevekambouris for the initial ideas and impromptu workshopping today!