ropensci / unconf14

Repo to brainstorm ideas (unconference style) for the rOpenSci hackathon.
28 stars 3 forks source link

Reproducibility #22

Open karthik opened 10 years ago

karthik commented 10 years ago

I think one thing we should really tackle, if possible, is the issue of reproducibility. Outside of our expert/super user bubble, regular scientists rarely use the suite of tools that we rely on every day. There are hardly any papers that are simply .Rmds to reproduce entire papers (sans journal's own style).

What are those roadblocks and what parts of that pipeline can we streamline with the higher levels tools that Hadley is known to write?

Moving from #18

karthik commented 10 years ago

Comments from earlier thread:

@jhollist says:

The Reproducibility problem strikes home for me.

I am a recent convert to the R Markdown/knitr/pandoc/makefile tool set and am quite enamored of it; however, many of my colleagues often point out that I am the unusual one and in spite of my proselytizing they are very unlikely to switch away from using Word. We could certainly make some progress by continuing to encourage others to try R Markdown/knitr/... and incorporating the same into undergraduate and graduate education, but that means we are at least a generation away from seeing significant changes.

I wonder if this group could make some progress towards making the existing tool set used by most scientists more reproducible. It seems that tackling reproducibility from the MS Word side could have the greatest impact. This would be similar to the way the DataUP project approached the problem of trying to get more scientists to better manage data and submit to DataONE. They worked with Microsoft Research and developed DataUp to work directly with Excel and have since moved most of that directly to the cloud. Not sure I am suggesting that route, just using it as a somewhat relevant example.

Given that I am most certainly on the extreme novice side of development (even more so with this group!) I have no ideas on how we might develop something for Word that could make it part of a reproducible workflow or even if that is possible. But seeing that Word and Office in general are moving to the cloud it seems like incorporating reproducible analysis via something like OpenCPU is more feasible than it ever has been (forgive me if I am talking nonsense here).

In any event, having been reminded multiple times from several of my co-workers that they just aren't going to spend the time hacking that I do, it seems that to really increase the reproducibility of science we need to address the problem where much of that science is actually happening. And unfortunately, that isn't exclusively with tools we think of as reproducible (i.e R, python, etc.).

karthik commented 10 years ago

From @EDiLD

I personally don't like knitr or sweave to write research papers. However, I still can maintain reproducible research.

Some thoughts:

I like LaTeX more the markdown. It's a little more complicated, but gives you much much more flexibility. Also some journals accept LaTeX format, though many insist of submitting a .doc file (grr....)

For a research paper code changes a lot and develops with time. With many code not used in the final paper. Same with text. Therefore I separate both.

My workflow/structure/setup is something like this: Folder structure:

/data -> raw data files /cache -> cache intermediate files (eg. after cleaning) /src -> R Home /report -> Latex home. and perhabs some others.... For R projects I follow Rob Hyndman:

/src/load.R /src/functions.R /src/clean.R /src/do.R and perhabs some others.... All the paths are setup in load.R (and used via file.path()).

R Code produces Figures which are then stored into /report/fig, and then are included into the LaTeX file. If I change anything in my Code the figures are also updated in LaTeX.

Generally I first develop code and then write the paper. Having R and LaTeX separated I can first develop Code and if it's finished write the paper. I can change some code and then compile the LaTEx doc and have an updated version with the new results.

If I publish, I just put my folder in the supplement and reproducers just need to change one path in load.R (But this is also explained in a README file).

So this is my workflow explained in a few lines, hope it is understandable. For research paper I would not want to miss the functionality of LaTeX. Markdown is easy, but in my opinion not flexible. It think with this workflow I can also ensure reproducible research (in the sence of scientific publications) without using knitr/sweave.

I am interested about your thoughts regarding my workflow and your experience in writing scientific papers with markdown...

karthik commented 10 years ago

From @mfenner

I'm not very interested extending Microsoft Word or Excel - only to the extend that I can import/export from the tools I use. For me reproducibility is very much linked to automation, and I just don't see how this can be easily done in those applications. Markdown, Github, Pandoc, Travis, etc. might look geeky now, but I'm happy to go that direction.

jeroen commented 10 years ago

I would like to contribute that reproducibility is not limited to generating reports. In all likelihood, data analysis will soon move to cloud based infrastructures where tools and principles from markdown/knitr will become more accessible natural part of the analysis process, even for users relying on a GUI.

However, much more challenging than weaving results into a document is software versioning. In OpenCPU I try to make the API reproducible by design, such that each resource stored on the system can naturally be recreated. This works, until a package is updated. We can't guarantee that results obtained with lme4_0.999999-2 last year can be reproduced using lme4_1.0-6.

My experience is that in practice almost no R script or document over 2 years old can be reproduced with current version of R and packages. I personally think addressing this problem is at least as important as weaving tools.

cboettig commented 10 years ago

Echoing some themes here:

I believe the major obstacles to reproducibility are

a. people not sharing any code to begin with, b. the challenges of software versioning that @jeroenooms mentions (which includes the challenge in #19 ), and c. scripts that use local paths or refer to local data files.

As much as I love knitr/sweave, I don't think it solves any of these issues. It does not address (a) well because it is simply not the easiest route to code sharing for most users. Much easier to ask folks to provide a script, preferably as a plain text file at a permanent URL with the minimal metadata discussed in https://github.com/mozillascience/code-research-object/issues/2.

The software versioning issue was one of the primary challenges the NESCent informatics team identified when trying to replicate my paper (and https://github.com/cboettig/prosecutors-fallacy/issues/1 and https://github.com/cboettig/prosecutors-fallacy/issues/2) and I believe other papers as well in their exercise. So even if knitr was widely used in publications, (which admittedly does solve (a)), this problem would remain. I don't have many ideas on how to address this one effectively, but would love to hear more thoughts.

The local paths problem is really just a data archiving issue and one I think we can address well with APIs to data publication repositories, particularly those with good support in the API for pre-release / private sharing of data. This should mean that a user can replace the local paths with remote paths to archived data well before publication.

eduardszoecs commented 10 years ago

@cboettig to a): Yes, that's also my way. I simply submit the Code with a README file. The README explains, what each script is doing and where the path needs to be adapted.

b) Minimum information would be Sys.info() and sessionInfo() in the README. The reproducers could try to setup the same packages/Versions on their machines. One could also submit a snapshot of the R version and the packages... But this is likely to get humongous... I think there was also some month ago an announcement on R-bloggers for a system/package doing this (forgotten the name :( )

c) Yes, a fixed url would be a soultion. But I think is one can expect, that reproducers are able to change one path in the script. Especially if it is described in a README. Eg. from one of my projects (were I cannot share the data, as it is propriatary :():

# load.R
#####
### Setup project structure
#####

## Project Path
## You have to change this!
prj <- "/home/edisz/Documents/Uni/Projects/mesocosm_methods/review/"

## Subfolder paths
datadir <- file.path(prj, "data")   # data
srcdir <- file.path(prj, "src")     # source code
cachedir <- file.path(prj, "cache") # caching objects
hilaryparker commented 10 years ago

I'm not sure if this was touched upon recently, but one key revelation I had when creating reproducibility of projects was the different "levels" of reproducibility. Here is what I mean by that:

Workflow: Raw data accessed via different script -> Raw dataframe in R -> Preprocessed/cleaned dataframe in R -> Statistical work/scripts in R -> R code for graphics and tables -> knit document

(these can obviously be collapsed at different points, etc.). I have a personal workflow that separates out these different steps. I can then reproduce a file from different points--for example, I might want to play with the code for graphics, but don't want to re-pull the data every time I do it. I've played around with the idea of there being different "levels" of reproducibility (i.e. level I, level II, etc.). I think this is also a concept that a lot of scientists haven't teased apart yet. Having this framework standardized could be really really helpful.

Also, a lot of my thinking on this was formed by using the projecttemplate package (https://github.com/johnmyleswhite/ProjectTemplate). Could we leverage that?

benmarwick commented 10 years ago

Great observation, we are musing on those exact thoughts here https://github.com/ropensci/reproducibility-guide/issues/5 and https://github.com/ropensci/reproducibility-guide/issues/4