Data and scripts for "Root volume distribution of maturing perennial grasses revealed by correcting for minirhizotron surface effects"

This is a Stan-based analysis of the root volumes of perennial bioenergy grasses, as observed by minirhizotron imaging at the EBI Energy Farm (Urbana IL) between 2009 and 2014.

A manuscript describing this project is in review. Email chris@ckblack.org if you'd like a copy of the current draft.

Raw images and the pixel-by-pixel tracing data are not stored here -- those live on the DeLucia fileserver. The primary "raw" data in this repository are the WinRhizo datafiles, which contain total length/area/volume and average width for each image, plus summaries of each root that are currently thrown out before analysis. I will include images and raw traces in the full data+script package, which will be made available on Dryad at the time the manuscript is accepted.

The whole analysis is intended to be fully reproducible. If anything changes, whether in raw data or final figure presentation, running $(make) in this directory should produce a fully updated version of the results.

Directory contents

data

Cleaned-up, finished, most authoritatize versions of datasets. Everything here is generated by some scripted process, NEVER by hand-editing.

NOTE: Some contents of the stan/ subdirectory are not committed in Git, because the output from a full model run is 845 MB large and needs to be recreated from scratch every time the model updates.

Included in the Git repository: Run logs, CSV summaries of posterior means/CIs of model parameters.
Not included in the Git repository: 791 MB of Rdata files containing all HMC samples from each run, 51 MB of PNGs showing every diagnostic plot I could think of. I'm happy to share these with anyone who's interested -- email chris@ckblack.org if I haven't replaced this sentence with a link yet :)

figures

Graphics generated from the cleaned-up data.

Everything in the top level of figures/ is autogenerated from the authoritative dataset whenever the Makefile is run.
Everything in the figures/static/ subdirectory is a hand-generated one-off, with code used to generate it if available.

images

Static images for presentation/manuscript purposes: sample root images, screenshots, images of fieldwork, etc.

Makefile

Script for the Unix make utility, specifying how each component of the project depends on others and providing rules for how to automatically update each file when the files that it depends on have changed.

notes

Human-readable information. What I did, what I didn't do, reminders, to-do lists, etc.

operator-agreement

A sub-experiment asking "how similar are the data produced by different workers tracing the same images?" I'm now using these same images as a worker training battery.

This directory is not updated by the whole-project Make; there is a local Makefile instead. To rerun the operator agreement scripts, cd operator-agreement && make. See operator-agreement/ReadMe.md for more details.

protocols

Field maps, instructions for camera operators, tube installation schematics...

rawdata

Uncleaned data in the form it came to me: WinRhizo files, hand-compiled spreadsheets. If anything in here needs to change, it probably means we had to redo a lot of hours of work.

scripts

Tools to automate the rest of the analysis. Mostly written in R, some in bash.

stan

Scripts for hierarchical Bayesian inference on how root volume differs between crops and over time, written in the probabalistic programming language Stan. Also contains R and Bash scripts to handle the process of running the models on the IGB computing cluster or, with patience, on a sufficiently powerful laptop. TODO: Consolidate contents into scripts/?

tmp

Things I don't intend to keep but am not deleting just yet, e.g. logged debugging output. This directory is ignored by git, but needs to exist because some scripts write to it.

Installing & running

To run the analysis scripts you'll need:

A working C compiler and a UNIX toolchain with at least Bash, Make, sed, tr, and probably others. If you don't have these installed already, consult your favorite search engine or local expert.
R >=3.2, available from https://www.r-project.org

The following R packages and all their dependencies. To install them all at once, open an R session and issue these commands:

install.packages(c(
    "rstan", "dplyr", "tidyr", "forcats",
    "viridis", "plotrix", "cowplot", "devtools", "lmerTest", "csvy"))
devtools::install_github("infotroph/efrhizo", subdir="scripts/rhizoFuncs/")
devtools::install_github("infotroph/DeLuciatoR")
devtools::install_github("infotroph/ggplotTicks")

To rerun my analyses: Open a shell, cd to the root of the project directory, type make, and walk away for at least an hour, or much longer if your computer has fewer than 5 CPU cores. The whole run takes ~80 minutes, mostly CPU-bound, on my 8-core mid-2015 Macbook Pro (2.2 GHz i7).

To run individual components: See comments in scripts, usage in the Makefile, and, uh, probably ask me questions about the parts I forgot to document.

The general shape of the data cleanup pipeline is as follows:

Raw WinRhizo output lives in rawdata/ef*.txt, but beware that filename capitalization is not consistent.
frametot_collect.sh removes measurements of individual roots and gathers all whole-image totals into one file per year.
slurpcals.sh calculates pixel<->mm calibration from WinRhizo calibration files (stored as rawdata/calibs*.CAL).
estimate_offset.r estimates installation offsets (mm of tube projecting aboveground, needed to convert location-within-tube to depth-in-soil) for each day, using field measurements where available (rawdata/tube_offsets/*.csv) and the target installation offset of 22 cm where they are not.
cleanup.r strips out bad data as specified in manually-compiled censor lists (rawdata/censorframes*.csv). Most of these are images that are too low-quality to trace confidently (blurry, dark, obstructed by mud in the tube, etc).
tractorcore-cleanup.R cleans up and reshapes data from the deep-coring experiment (rawdata/Tractor-Core*.csv).

To fit Stan models to the clean rhizotron data:

"clean" data lives in data/stripped*.csv, but always gets passed to yet another cleanup script... see below. TODO: Fix this.
Each file in stan/*.stan defines one probability model for the distribution of root volume in the soil profile. See the comments at the top of each file for a more detailed description. When called with a particular dataset, Stan compiles this model into a working Hamiltonian Monte Carlo sampler and generates direct draws from the posterior distribution. The model presented in the manuscript is mctd_foursurf.stan.
Each *.stan should have a matching *.R that is responsible for gathering data, passing it to Stan, running the sampler, printing some summary statistics to the log, and saving the resulting samples in an Rdata file.
Every *.R script starts by calling scripts/stat-prep.R, which gathers all the clean rhizotron data into one data frame and does some minor re-shaping before passing it off to the run-specific script. This script either ought to do a lot more of what the run scripts do, or it ought to be folded into one of the upstream cleanup scripts.
All the Bash scripts in stan/*.sh are simple wrappers to loop over each sampling session calling the appropriate R script that calls the appropriate Stan model. My analyses all usemctd_foursurf.sh.
After running Stan, extract samples from the saved Rdata file and plot/analyze them as desired. I use extractfits_mctd.R and plotfits_mctd.R respectively; beware that these scripts are tightly coupled to the exact output format of the mctd_foursurf model, and it will probably be a pain to use them on any of the other models in the directory.
To calculate differences between parameters from different sampling days, scripts/plot_chaindiffs.R loads up the full MC chains from all fitted models at once and computes quantiles on the elementwise differences between days for each parameter of interest. Be wary of this script for two reasons:
- It manipulates all the samples at once, so each calculation involves ~2M datapoints, and I wrote it with zero thought about efficiency. The current version is slow and memory-hungry.
- It is tightly coupled to the exact structure and naming conventions of the models saved by mctd_foursurf.R and will probably break if they change. TODO: fix this, possibly in a way that involves using the Bayesplot package.
Be sure to use Stan >= 2.14: the model uses syntax that was not available before 2.13 (specifically vectorized calls to log_inv_logit), and 2.13 was affected by a subtle sampler bug that is fixed in 2.14.

I have run mctd_foursurf successfully on OS X 10.11.6 and Amazon Linux AMI release 2016.03, but have not tested it in Windows. The other models appear to run well on my machine, but I haven't tested them cross-platform and I have not validated their output carefully. Consider them work in progress!

Questions? Chris Black black11@illinois.edu or chris@ckblack.org or https://twitter.com/infotroph 503-929-9421

infotroph / efrhizo

readme