jvivian / gene-outlier-detection

A Bayesian model for identifying gene expression outliers for individual single samples (N-of-1) when compared to a cohort of background datasets.
9 stars 3 forks source link

N-of-1 Gene Outlier Detection

For RNA-seq Expression Data

Code style: black Build Status Coverage Status

This package identifies outliers for gene expression data by building a consensus distribution from background datasets that are informed by an N-of-1 sample. See Model Explanation for more information, our paper in JCO, or our preprint.

The model has undergone significant performance improvements and is no longer identical to the model in the paper, but produces results that are essentially identical (PearsonR > 0.995).

This workflow takes gene expression data as input and outputs the following:

SAMPLE_UUID
├── _info
│   ├── _gelman-rubin.tsv
│   ├── _pearson_correlations.txt
│   ├── _pval_runs.tsv
│   └── _run_info.tsv
├── model.pkl
├── pvals.tsv
├── ranks.tsv
├── traceplot.png
├── weights.png
└── weights.tsv

Quickstart

  1. Install
    pip install gene-outlier-detection
  2. Download the prerequisite inputs
  3. Run the model
    outlier-detection --sample /data/tumor.hd5 \
        --background /data/gtex.hd5 \
        --name TCGA-OR-A5KV-01 \
        --gene-list /data/drug-genes.txt \
        --col-skip 5

Dependencies and Installation

This workflow has been tested on ubuntu 18.04 and Mac OSX, but should also run on other unix based systems or under an Anaconda installation.

  1. Python 3.6
  2. Docker if using the Docker version or Toil workflow version
  3. HDF5 library (if inputs are in HDF5 format)
    1. conda install -c anaconda hdf5
  4. C++ / GCC compiler for PyMC3's Theano
    1. You have a couple options
      1. conda install theano
      2. apt-get update && apt-get install -y libhdf5-serial-dev build-essential gcc

You may need to modify your ~/.theanorc to support larger bracket depth for this model. This is likely not necessary anymore with the new model improvements.

[gcc]
cxxflags = -fbracket-depth=1024

Model Explanation

In the following plate notation, G stands for Gene, and D stands for Dataset (background dataset).

The core of our method is a Bayesian statistical model for the N-of-1 sample’s gene expression. The model implicitly assumes that the sample’s gene expression can be approximated by a convex mixture of the gene expression of the background datasets. The coefficients of this mixture are shared across genes, much like a linear model in which each data point is the vector of expressions for a gene across the background datasets. In addition, we model expression for each gene from each background dataset as a random variable itself. This allows us to incorporate statistical uncertainty from certain background sets’ small sample size directly in the model without drowning out any background set’s contribution through pooling

The model can be explored using Markov chain Monte Carlo (MCMC) to obtain samples for y (per gene) that approximate its posterior distribution. If we have an observed expression value for a gene of interest (from the N-of-1 cancer sample), we can compare it to the sampled values. The proportion of sampled values for y that are greater (or lesser) than the observed value is an estimate of the posterior predictive p-value for this expression value. The posterior predictive p-value can be seen as a measure of how much of an outlier the expression is given the expectations of the comparison set

This model has been drastically improved for performance by using vectorized random variables. The gene expression RVs are flattened so they can be passed a 1D vector of observations. These RVs are reshaped into a 2D matrix matrix so the linear equation can be computed via standard dot product. The expression RVs had difficulty fitting student-T distributions due to the wide range of nu and was replaced with a Normal distribution. There is also no longer any computational "hack" of pre-fitting the student-T distributions to the expression values in the background and using those as prior values. The expression observations are now directly integrated with the rest of the model.

A graph of the new model, for 125 genes and 3 datasets with 48,500 values in the background dataset is shown below:

Defining Inputs

The model requires two Sample by Gene matrices, one containing the N-of-1 sample and one containing samples to use as the background comparison set. They should both contain the same set of genes and the background dataset must contain at least one metadata column with labels that differentiate groups (e.g. tissue, subtype, experiment, etc). Metadata columns need to be the left-most columns in the dataframe followed by all genes. This allows a simple heuristic (--col-skip) to designate where the genes start without needing the user to provide an additional input file.

Here, tissue is one group that is used to differentiate the samples. We would run with --col-skip=5 as there are 5 metadata columns before the start of the genes.

Docker Container

A Docker container containing the program can be executed as follows:

docker run --rm -v $(pwd):/data jvivian/gene-outlier-detection \
        outlier-model \
        --sample /data/inputs/tumor.hd5 \
        --background /data/inputs/gtex.hd5 \
        --name=TCGA-OR-A5KV-01 \
        --gene-list /data/inputs/drug-genes.txt \
        --out-dir /data/outputs/ \
        --col-skip=5

Arguments

Explanation of arguments used when running the program.

Toil-version of Workflow

A Toil version of the workflow is available here. This allows the model to be run on multiple samples at scale on a cluster or cloud computing cluster, but requires Python 2.7 and pip install pandas toil==3.19.0

Toil is now Python 3 compliant, so this method of running the program isn't recommended until the Toil code has been updated.

Citation

If you leverage this method in your research, please cite our JCO paper.

Vivian, John, et al. "Bayesian Framework for Detecting 
Gene Expression Outliers in Individual Samples." 
JCO Clinical Cancer Informatics 4 (2020): 160-170.