This package identifies outliers for gene expression data by building a consensus distribution from background datasets that are informed by an N-of-1 sample. See Model Explanation for more information, our paper in JCO, or our preprint.
The model has undergone significant performance improvements and is no longer identical to the model in the paper, but produces results that are essentially identical (PearsonR > 0.995).
This workflow takes gene expression data as input and outputs the following:
SAMPLE_UUID
├── _info
│ ├── _gelman-rubin.tsv
│ ├── _pearson_correlations.txt
│ ├── _pval_runs.tsv
│ └── _run_info.tsv
├── model.pkl
├── pvals.tsv
├── ranks.tsv
├── traceplot.png
├── weights.png
└── weights.tsv
-d
is provided)model
and trace
. Model must be run with --save-model
flag and can be retrieved via
import pickle
with open(pkl_path, 'rb') as buff:
data = pickle.load(buff)
model, trace = data['model'], data['trace']
pip install gene-outlier-detection
outlier-detection --sample /data/tumor.hd5 \
--background /data/gtex.hd5 \
--name TCGA-OR-A5KV-01 \
--gene-list /data/drug-genes.txt \
--col-skip 5
This workflow has been tested on ubuntu 18.04 and Mac OSX, but should also run on other unix based systems or under an Anaconda installation.
conda install -c anaconda hdf5
conda install theano
apt-get update && apt-get install -y libhdf5-serial-dev build-essential gcc
You may need to modify your ~/.theanorc
to support larger bracket depth for this model. This is likely not necessary anymore with the new model improvements.
[gcc]
cxxflags = -fbracket-depth=1024
In the following plate notation, G stands for Gene, and D stands for Dataset (background dataset).
The core of our method is a Bayesian statistical model for the N-of-1 sample’s gene expression. The model implicitly assumes that the sample’s gene expression can be approximated by a convex mixture of the gene expression of the background datasets. The coefficients of this mixture are shared across genes, much like a linear model in which each data point is the vector of expressions for a gene across the background datasets. In addition, we model expression for each gene from each background dataset as a random variable itself. This allows us to incorporate statistical uncertainty from certain background sets’ small sample size directly in the model without drowning out any background set’s contribution through pooling
The model can be explored using Markov chain Monte Carlo (MCMC) to obtain samples for y (per gene) that approximate its posterior distribution. If we have an observed expression value for a gene of interest (from the N-of-1 cancer sample), we can compare it to the sampled values. The proportion of sampled values for y that are greater (or lesser) than the observed value is an estimate of the posterior predictive p-value for this expression value. The posterior predictive p-value can be seen as a measure of how much of an outlier the expression is given the expectations of the comparison set
This model has been drastically improved for performance by using vectorized random variables. The gene expression RVs are flattened so they can be passed a 1D vector of observations. These RVs are reshaped into a 2D matrix matrix so the linear equation can be computed via standard dot product. The expression RVs had difficulty fitting student-T distributions due to the wide range of nu and was replaced with a Normal distribution. There is also no longer any computational "hack" of pre-fitting the student-T distributions to the expression values in the background and using those as prior values. The expression observations are now directly integrated with the rest of the model.
A graph of the new model, for 125 genes and 3 datasets with 48,500 values in the background dataset is shown below:
The model requires two Sample by Gene matrices, one containing the N-of-1 sample and one containing samples to use as the background comparison set.
They should both contain the same set of genes and the background dataset must contain at least one metadata column with labels that differentiate groups (e.g. tissue, subtype, experiment, etc). Metadata columns need to be the left-most columns in the dataframe followed by all genes. This allows a simple heuristic (--col-skip
) to designate where the genes start without needing the user to provide an additional input file.
Here, tissue
is one group that is used to differentiate the samples. We would run with --col-skip=5
as there are 5 metadata columns before the start of the genes.
A Docker container containing the program can be executed as follows:
docker run --rm -v $(pwd):/data jvivian/gene-outlier-detection \
outlier-model \
--sample /data/inputs/tumor.hd5 \
--background /data/inputs/gtex.hd5 \
--name=TCGA-OR-A5KV-01 \
--gene-list /data/inputs/drug-genes.txt \
--out-dir /data/outputs/ \
--col-skip=5
Explanation of arguments used when running the program.
--sample
--background
--name
--out-dir
--gene-list
--col-skip
--group
--num-backgrounds
num-backgrounds
is met.--max-genes
--pval-convergence-cutoff
--num-training-genes
gi
--tune
--disable-iter
--num-backgrounds
.--save-model
A Toil version of the workflow is available here. This allows the model to be run on multiple samples at scale on a cluster or cloud computing cluster, but requires Python 2.7 and pip install pandas toil==3.19.0
Toil is now Python 3 compliant, so this method of running the program isn't recommended until the Toil code has been updated.
If you leverage this method in your research, please cite our JCO paper.
Vivian, John, et al. "Bayesian Framework for Detecting
Gene Expression Outliers in Individual Samples."
JCO Clinical Cancer Informatics 4 (2020): 160-170.