Revisiting the research compendium: testing, automation, and review

noamross commented 7 years ago

In the 2015 unconf, one output was a document of best practices for research compendia, a self-contained repository for a research analysis. Many of the ideas in this, and similar work, derive from best practices from R packages. In the past few years, there have been advances and wider adoption of a number of R package development practices, notably in package testing, automation of testing/checking/building, and best practices for code review. R-based research compendia have not coalesced around a similar set of practices yet. I would aim to build tools that would help address this, with these questions.

What is the state of of analysis workflow automation, are there gaps and how can they be addressed?
What should testing workflow and tooling for a research compendium workflow look like? How and should we separate testing from processing and analysis?
Is there a standard set of checks, a la R CMD check, that could be widely adopted?
What would an rrtools package, similar to devtools, contain to aid in creating reproducible research compendia?
How would we change or improve CI infrastructure or adopt it for research compendia?
What practices/checklists for analysis code review should we pilot or adopt?

(I have thoughts on most of these that I'll add below or in a linked document for these in a bit)

Possible outputs include

another best practices document
code review checklist
the start of an rrtools package
templates for CI or a PR to the current Travis-CI R engine
contributions to remake, tic, pkgdown, or other infrastructure packages

hadley commented 7 years ago

One simple thing would be to add some metadata to DESCRIPTION that identifies this as a compendium (not a package), and then travis could either:

call rmarkdown::render() each .Rmd in vignettes/
if a Makefile is present, call make

hadley commented 7 years ago

Also see usethis, where I've been pulling out all the use_* functions from devtools in a way that's easier to reuse.

noamross commented 7 years ago

I'm more of a "sensible defaults with config" person, but I really like the metadata in DESCRIPTION convention. e.g., Compendium: default, but also Compendium: remake.

I also just started wrapping @jennybc's tree implementation mostly for the purpose of showing project directory navigation in a README.

cboettig commented 7 years ago

@hadley brilliant. I love the idea of just leveraging DESCRIPTION metadata & use_travis() here. I've just added the line script: R -f test.R to .travis.yml to skip the R CMD build/CHECK stuff and run a tiny R script which calls rmarkdown on any .Rmd in the directory: https://github.com/cboettig/compendium

Yay! My students can now face the :joy: and :sob: and :fire: of travis builds on their very own homework assignments.

hadley commented 7 years ago

I think the place to start would be to try and use the existing type field. If that doesn't break install.packages(), then you could have:

Type: compendium
Type: compendium/remake
# etc

Or maybe

Type: article

Figuring out how to cleanly parameterise travis builds in this way is useful enough that we should do it independently of the other compendium issues. We just need some simple convention + default that make this easily extensible.

Maybe we could have the value of the field be the name of the package, and then travis just installs that package then calls pkgname::build("."). What do you think @jimhester ?

gaborcsardi commented 7 years ago

If that doesn't break install.packages(), then you could have:

Not only is R CMD install fine with this, R CMD check is "fine", too. Although it does nothing.

If you use TypeNote, then you can even run R CMD check on it. (I am not sure if you would always want to be able to, just an observation.)

cboettig commented 7 years ago

oh right, using a custom script: command in .travis.yml replaces R travis's call to install as well as to check. I think this is a minimal example that now both installs based on the DESCRIPTION file and knits any .Rmds: https://github.com/cboettig/compendium

Of course a utility that generated such a template .travis.yml and any additional test.R script based just on the Type: designation in DESCRIPTION would be way cool.

stephlocke commented 7 years ago

I did some work on this type of generation on the weekend (got a post scheduled on it) but this custom travis file plus shell scripts generates each Rmd in a specific dir https://github.com/lockedata/pres-stub

Dynamic file gen like generating DESCRIPTION and .travis.yml is the next thing on the pRojects todo for us to get to grips with so that we can start producing files that contain content based on the users input.

Have the use* in a separate package will help as we make extensive use of these in pRojects

(Thanks for the kind words about pRojects @cboettig btw - very much appreciated!)

hadley commented 7 years ago

One problem with using Type is that it prevents a project from being both a package and a compendia.

@cboettig I think such a capability should be baked into travis. Auto-building an .travis.yml is going to be fragile.

jimhester commented 7 years ago

oh right, using a custom script: command in .travis.yml replaces R travis's call to install as well as to check.

No it doesn't, you should be able to use the default install: step assuming you have a DESCRIPTION file, just override the script: step to do something other than run R CMD check, e.g. script: R -e 'pkgname::build()' in Hadley's example.

Pakillo commented 7 years ago

Hi, Independently of using vignettes or another folder to hold the Rmd reports, we do have found convenient to have two separate folders: for preliminary stuff and the final report or manuscript. For example, we use an analyses folder to store all the exploratory data analyses, complete model runs with residual plots etc. And then a manuscript folder in which the Rmd only includes the stuff that goes in the manuscript. I would find it very messy to have all the Rmds in the same folder. But that's just our experience, of course :)

For the record, here are other repos/templates for projects structured as R packages still not mentioned (I think):

hadley commented 7 years ago

I have summarised my take aways from this discussion into a research compendia proposal (google doc). Comments welcome!

benmarwick commented 7 years ago

Some observations on directory naming practices of research compendia spotted in the wild:

name of main analysis directory	n	sources
`analysis`	4	https://github.com/duffymeg/BroodParasiteDescription, https://github.com/cylerc/AP_SC, https://github.com/benmarwick/mjbtramp, https://github.com/benmarwick/ktc11
`vignettes`	3	https://github.com/famuvie/ArchaeologicalFloors, https://github.com/benmarwick/1989-excavation-report-Madjebebe, https://github.com/sje30/eglen2015/
`manuscript/s`	2	https://github.com/cboettig/nonparametric-bayes, https://github.com/benmarwick/Pleistocene-aged-stone-artefacts-from-Jerimalai--East-Timor
`vignettes/manuscript`	1	https://github.com/USEPA/LakeTrophicModelling/

It's a very small sample, but it seems that analysis and manuscripts are popular non-standard directory names for the core directory of the research compendium. This reflects the orientation of these compendia (and my interest) toward scholarly publication as the final product. This can be contrasted with other research contexts, such as reports for business applications, which I'm less aware of. But I guess that manuscripts directories doesn't make much sense for researchers in commercial settings.

Naming the main compendium directory analysis would seem to be a natural choice that makes sense for both academic and commercial research contexts.

hadley commented 7 years ago

Is there something more general than analysis but more specific than scripts? process/? activity/? task/?

Should it be a verb or a noun? (I think a noun because all the other directories are nouns). Should it be singular or plural? (This is harder because they're mostly singular apart from vignettes/).

OTOH scripts/ is nice because the directory name usually defines the type of its contents. OTOOH is a notebook a script? (yes?) Is a data file a script? (no)

karthik commented 7 years ago

compose?

jhollist commented 7 years ago

Thanks for the "in the wild" summary @benmarwick! "Wild" very definitely describes https://github.com/USEPA/LakeTrophicModelling/ and that ended up being a bit of a mess by the end. Wish I could say the decision to have a subfolder in vignettes was a conscious one, but it wasn't. Since that time I have used a separate manuscript folder.

cboettig commented 7 years ago

Maybe stating the obvious, but while this sample is great it's worth noting they aren't very independent. E.g. I think Megan's original layout was more representative of what I see people doing before we all, um, piled in https://github.com/duffymeg/BroodParasiteDescription/pull/1

Okay, so I'm all for establishing convention over configuration, but I'm not actually clear on why we need to specify a choice here. I see that using the existing vignettes saves config, but here it seems the same amount of config could just treat any nonstandard top level dir this way?

cboettig commented 7 years ago

One significant issue that isn't addressed in Hadley's excellent doc summary is publishing / sharing of output. I think this is one area that could benefit hugely from both more normative conventions and additional tooling.

Personally, I like to see the analysis/ dir (or whatever it is called) use github_document as the output format (diffs & displays nicely on GitHub. For a final product like a manuscript Rmd, a PDF is probably more appropriate, but during development (or perhaps for purely supplementary material content) it's nice to commit something more text/based diff-able (and has no risk of code being cut off the margin).

Even so, this approach has problems or at least open questions on how to do it. The default of dumping output .md / .pdf etc into the same working dir as the input .Rmd does save new users lots of headaches about path, but it also defies convention of separating inputs and outputs, makes it harder to find relevant content (Particularly with GitHub rendering .Rmd versions as well now -- students coming from Jupyter click on these ones and then ask: but where are the figures??), and complicates a manual version of make clean (e.g. deleting an 'output dir').

(Aside to the RStudio team: it would also be easier to make github_document more normative if the option wasn't quite so buried in the RStudio menu!)

Contrast this situation to the case in R packages where we now have pkgdown as a nice way to share vignette output while keeping a clean package repo containing input and output separately.

noamross commented 7 years ago

@cboettig This, to me, gets back to the tension between "working" and "final" compendia. I like the solution of .md output files in the analysis/ directory, but more "final" documents in vignettes/ where they have other outputs, including pkgdown docs. The .md aren't quite "outputs", but working intermediates. Similarly, my working repositories usually end up having things like model objects saved as .rds files, which aren't ultimate outputs but are important to retain during the working phase so that they can be shared and inspected.

A bigger challenge in input/output is data/. The idea of data as output is not adequately addressed in a lot of workflow templates, and having that data documented and installable is great. But if you want to document and test both input and output data, that puts them in the same place and its not immediately obvious how to distinguish between the two.

(I think getting GitHub render to .Rmd files was a mistake, myself.)

cboettig commented 7 years ago

Just wanted to second the issue on data output. The Google doc comments reflect a lot of variation in how we view derived data:

does it belong in data or is that just for raw/input data?
-Should every/most analyses be saving derived/tidy data, or is cleaned data just one more intermediate object to ignore?
Should we have a mechanism to save/share the (derived) data behind each figure?

Closing off my earlier comment about publishing/discovering final results, I agree with the general proposal that pkgdown provides a good solution for this; meanwhile more intermediate / low-stakes products can be left in more messy form as, say, github_markdown in an analysis dir.

gshotwell commented 7 years ago

I'm a little late to the party on this one, but I wanted to sketch out the approach I took in easyMake in case it's helpful for the "working" and "final" compendia tension. What easyMake does is reads the source files in an R project for input (like read.csv()) and output ( like write.csv()) functions in order to automatically detect dependencies between the different files. It then uses this to produce a Makefile for that project.

I think this is a pretty promising approach to resolving the "working" and "final" tension that @noamross mentioned because so long as you break your analysis into scripts you should be able to generate a working Makefile from those scripts. This helps new users get started with Makefiles and also is a good way to take a look at the whole project and see how your workflow could be improved.

I'm not sure if easyMake is an optimal implementation of this approach but the idea of automatically detecting dependencies based on the IO functions might be worth incorporating into the research compendia packages.

hadley commented 7 years ago

I've fleshed out some notes on what a reactive build system my look like for R: https://docs.google.com/document/d/1avYAqjTS7zSZn7JAAOZhFPkhkPvYwaPVrSpo31Cu0Yc/edit#. This started out like "make for R", but has ended up fairly far away, drawing just as much inspiration from shiny. Your comments are greatly appreciated!

noamross commented 7 years ago

Testing package development happening here: https://github.com/ropenscilabs/checkers Review guide happening here: https://docs.google.com/document/d/1vSbT9dcGTeUYDvSHclr3U8fLAqKxQsSBk3DJdMaK5ak/edit#

benmarwick commented 7 years ago

Inspired by the discussion here, I recently worked with @danki, @MartinHinz and others at @isaakiel to make a start on an rrtools package for bootstrapping a research compedium: https://github.com/benmarwick/rrtools

We carefully reviewed the literature on best practices and tried to boil them down to a novice-friendly workflow. It's an opinionated approach, for sure, but it also gives options at key decision points that tries to capture the diversity we see in the discussion above and elsewhere.

No doubt we've missed some variants, but we're happy to take suggestions to make it more broadly useful!

stefaniebutland commented 7 years ago

Blog post about this project: https://ropensci.org/blog/2017/06/20/checkers/

ropensci / unconf17

Revisiting the research compendium: testing, automation, and review #5