ropensci / unconf15

rOpenSci's San Francisco hackathon/unconf 2015
http://unconf.ropensci.org
36 stars 7 forks source link

Packages as research repositories/compendia #11

Open benmarwick opened 9 years ago

benmarwick commented 9 years ago

At last year's rOpenSci event we worked on a short guide to reproducible research, under @iamciera's guidance. Some of the most interesting progress on this topic since then has been on using the R package framework as a research repository or compendium for scholarly work, cf. @rmflight's blog posts, @cboettig's template package, @pakillo's template package and @jhollist's manuscriptPackage, etc.

The concept of a research compendium has been around for a while (cf. Gentleman 2005, Gentleman & Temple Lang 2007, Stodden 2009, Leisch et al. 2011). Many of us are making custom R packages to accompany our research publications to improve reproducibility, but I think there are a bunch of questions are what are the best ways to do this.

Perhaps at the unconf we can have a discussion to share some of the ways we're using R packages as research compendia, and draft a few guidelines to add to the guide. The goal would be to help domain scientists, especially those who are primarily not tool-developers and already prolific package authors, get started with this. @hadley's book is of course an excellent resource on R packages generally, but using packages as research compendia raises some specialised questions that this ropensci group are uniquely qualified to tackle.

Some of the questions that I'd like to learn more about on this topic include:

cboettig commented 9 years ago

Not surprisingly I'm also interested in this. To add the list:

richfitz commented 9 years ago

:+1: seems like a good idea, possibly also linking with #6 as (a) what is a manuscript if not an artefact of research? and (b) how do you store outputs with the compendium?

jhollist commented 9 years ago

I won't be at that unconf (will be following along remotely), but wanted to add my support to the idea of discussing packages as research compendium. Couple of observations:

I may sound like I am down on the idea of using packages, but I am not. I have found a lot of benefit in using the package format and specifically using the vignette as manuscript. I will use the model again and anything that comes out of this discussion would be great to include for my next manuscript.

jordansread commented 9 years ago

I think there is some good discussion to be had as to whether the goal of the reproducibility charge is the end-to-end publication target (including the issues pointed out above w/ citation management) or the generation of publication components that are data/code/methods related. This is a topic that I am very interested in, and some things that we have been working on are more geared towards including R packaging (or something of the like) in larger collaborations as both the analytical tools and the product component (figs/tables) building as part of a project-level CI. I think there is much to be done in terms of steering the research process towards reproducibility, and it is going to become more important as data/questions increase in complexity and the teams continue to grow and diversify.

gmbecker commented 9 years ago

@cboettig the dynamic documents vs scientific narratives is a tough one. I did some work on non-linear dynamic documents, where a narrative is a path through the graph of document elements for my thesis. See, eg https://github.com/gmbecker/DynDocModel (hoping to find time to bump this back up to the /back/ burner ...). Things get very complicated very quickly, though.

Something akin to the Vistrails approach http://www.vistrails.org/index.php/Main_Page#Publishing_Reproducible_Results , with a database of code and artifacts that a dynamic/"live" paper pulls from/recomputes at compile or view time might be more useful in practice. At least in the short term.

A modification of Gavish and Donoho's proposed VCRs http://www.sciencedirect.com/science/article/pii/S1877050911001256 is another possiblity, though AFAIR they they call only for verification, not dynamic reproduction.

stewid commented 9 years ago

Have added the method bundle_repo to git2r that might be useful in this context. It clones the package repository as a bare repo to inst/pkg.git so that when the package is installed the repo can be accessed with repo <- repository(system.file("pkg.git", package = "pkg")). I'm also planning to add the argument session (FALSE/TRUE) to the commit and tag methods to append the sessionInfo to the commit/tag message.

tracykteal commented 9 years ago

One suggestion for tracking provenance from @metamattj is the recordr package https://github.com/NCEAS/recordr/

benmarwick commented 6 years ago

To follow up a bit on this, one of the outcomes of the 2015 unconf discussion was this essay:

https://github.com/ropensci/rrrpkg

And we expanded that into this pre-print:

https://peerj.com/preprints/3192/

Which will shortly appear in The American Statistician in a collection of papers on 'Practical Data Science for Stats'

tracykteal commented 6 years ago

Awesome! And thanks for posting the follow up here.

tracykteal commented 6 years ago

And just to add an idea for more work :) would you be interested in a blog post on this to cross-post on the rOpenSci and Software/Data Carpentry blog, or just put on one? I imagine @stefaniebutland on the rOpenSci and @weaverbel on SWC/DC could help.

stefaniebutland commented 6 years ago

@tracykteal Is this post on an unconf17 project relevant here? Tackling the Research Compendium at runconf17 https://ropensci.org/blog/blog/2017/06/20/checkers