This method is described in the paper Enabling Inference for Context-Dependent Models of Mutation by Bounding the Propagation of Dependency, by Erick Matsen and Peter Ralph (2022) - preprint here.
Words and math describing the method are in the subdirectory writeups/
.
The method is implemented in an efficient way in pure R through the use of sparse matrices, making use of the fact that once certain structures are set up, the values of the relevant matrices can be computed using fast linear algebra. R functions implementing various aspects of the method are in three packages:
contextual
: the basic machinery for building generator matrices, transition matrices, and dealing with Tmer countssimcontext
: simulation from context-dependent modelscontextutils
: miscellaneous other utility functions used in this projectThe code structure and key functions are described in this file.
To install these, run
library(devtools)
install_github("petrelharp/context/contextual")
install_github("petrelharp/context/simcontext")
install_github("petrelharp/context/contextutils")
or else to do it from a local copy:
git clone https://github.com/petrelharp/context.git
cd context
Rscript -e 'library(devtools); install("contextual"); install("simcontext"); install("contextutils")'
Simulation requires some Bioconductor packages; see below for how to install those.
The general strategy for inference and visualization is:
json
filesRscript (scriptname) --help
for optionsThese are in the scripts/
directory.
The most useful ones are:
Computation:
sim-seq.R
- simulate from the modelcount-seq.R
- count Tmer occurrences from simulated datamake-genmat.R
- precompute generator matrices (for a large model this can be time-consuming, but the results can be reused in subsequent steps)fit-model.R
and fit-tree-model.R
- fit a model using maximum likelihoodcompute-resids.R
- compute the difference in number of Tmers seen in data to those predicted under a fitted modelmcmc-model.R
- find the posterior distribution of parameters using MCMC (note: can continue from previous MCMC runs)Visualization:
templated-Rmd.sh
- compile a Rmarkdown file into html using a particular data setsimulation.Rmd
- visualize Tmer counts in a data file and model fitcheck-sim.Rmd
- check simulation results match expected countsCompilation and comparison of different models:
gather-results.R
- compile information across many model fitting procedures (e.g., different window lengths)collect-params-results.R
- pull parameters from many json files into a tableFull analysis pipelines, from simulation to inference and visualization,
are implemented for several example models (see writeup for descriptions).
The first few are motivated by statstical physics, not DNA,
but serve as good examples.
In each directory are shell scripts (usually workflow.sh
) that demonstrate the workflow.
tasep
: "totally asymmetric simple exclusion process" - two states, which are 'empty' and 'occupied' slots for moving particlesising
: a "spin glass" model of up/down magnetic particlescpg
: a simple model of nucleotide evolution with an increased rate of CG -> TG
mutationstree-cpg
: the cpg model, but rather than inference using before-after observations, observations evolve in two sister taxashape
: a model of purifying selection on DNA shapeTo install the prerequisites separately:
install.packages(c("expm", "mcmc", "stringdist", "optparse", "jsonlite", "ape", "rmarkdown", "ggplot2", "pander"))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("Biostrings", "IRanges", "S4Vectors"))