morrislab / pairtree

Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
MIT License
37 stars 11 forks source link

Pairtree

Pairtree infers the phylogeny underlying a cancer using genomic mutation data. Pairtree is particularly suited to settings with multiple tissue samples from each cancer, providing separate estimates of subclone frequency from each sample that constrain the set of consistent phylogenies. The algorithm consists of two phases:

  1. Compute pairwise relation tensor over all mutations (or clusters of mutations). This provides the probability over each of four possible evolutionary relationships between each pair of mutations A and B.

  2. Use the pairwise relation tensor to sample possible phylogenies, assigning a likelihood to each. As each phylogeny is sampled, Pairtree computes the subclone frequency of each mutation (or cluster of mutations) within the tree, balancing the need to fit the observed mutation data while still obeying tree constraints.

    The algorithm is described in Reconstructing complex cancer evolutionary histories from multiple bulk DNA samples using Pairtree (Wintersinger et al. (2022)).

    The practical aspects of using Pairtree are described in our STAR protocol paper Reconstructing cancer phylogenies using Pairtree, a clone tree reconstruction algorithm (Kulman et al. (2022)).

DOI

Table of Contents

  1. Installation
  2. Examples
  3. Pairtree Executables
  4. Pairtree Inputs
  5. Clustering mutations
  6. Pairtree outputs

Installing Pairtree

For more details on installation, please see here.

  1. Install dependencies. Installation is usually easiest if you use Mambaforge, which includes a recent version of Python alongside the Mamba package manager. As Mamba is a faster reimplementation of Conda, you can alternatively use Conda or Miniconda.

    Refer to requirements.txt for the full list of dependencies. These can be installed in a new environment with the following commands. If you're using Anaconda instead of Mamba, replace mamba with conda below.

    To install all requirements using mamba (or conda), first clone the Pairtree repository and then create a new virtual environment with the dependencies.

    git clone https://github.com/jwintersinger/pairtree
    mamba create --name pairtree --file requirements.txt

    Or install the dependencies in your current environment:

    git clone https://github.com/jwintersinger/pairtree   
    conda create -n pairtree --file requirements.txt --yes
    conda activate pairtree

    Pairtree has only been tested on Linux systems, but should work on any UNIX-like OS (including macOS).

  2. Download and build the C code required to fit subclone frequencies to the tree. This algorithm was published in Jia et al., and uses the authors' implementation with minor modifications.

    git clone https://github.com/jwintersinger/pairtree
    cd pairtree/lib
    git clone https://github.com/ethanumn/projectppm
    cd projectppm
    bash make.sh
  3. Download plotting requirements. Unfortunately, not all plotting requirements Pairtree uses are available via conda. Instead, we recommend downloading them via pip. (optional)

    pip3 install -r plot_requirements.txt

Examples

After installing Pairtree, you can test your installation using provided example data.

    PTDIR=$HOME/path/to/pairtree
    cd $PTDIR/example
    mkdir results && cd results
    # Run Pairtree.
    $PTDIR/bin/pairtree --params $PTDIR/example/example.params.json $PTDIR/example/example.ssm example.results.npz
    # Plot results in an HTML file.
    $PTDIR/bin/plottree --runid example $PTDIR/example/example.ssm $PTDIR/example/example.params.json example.results.npz example.results.html
    # View the HTML file.
    open example.results.html

For a more detailed example of how to run Pairtree and interpret the results, please see our STAR Protocols paper Reconstructing cancer phylogenies using Pairtree, a clone tree reconstruction algorithm.

Pairtree executables

Pairtree provides several executable files to be used at different stages of your clone-tree-building pipeline. You can run each executable with the --help flag to get full usage instructions.

Input files

Pairtree requires two input files. See here for more details.

SSM file

Pairtree needs a .ssm (i.e., simple somatic mutation) file, which provides integer counts of the number of variant and total reads at each mutation locus in each tissue sample. The .ssm file is tab-delimited, with the first line serving as a header, and each subsequent line providing data for one mutation. You may wish to consult an example .ssm file.

The .ssm file contains the following fields:

Parameters file

Pairtree also needs a .params.json file, which provides parameters for the Pairtree run. You may wish to consult an example .params.json file.

The .params.json file denotes three objects:

Imputing read counts for mutations missing in some samples

When some of your mutations appear in only some of your samples, you must impute the total read count for those mutations in the samples where they're not present. While the variant read count should clearly be zero in such cases, the value the total read count should take is unclear. This task is important, however, because it informs Pairtree how confident you are that the variant is indeed not present in the sample -- an instance where you have zero variant reads amongst ten total reads indicates considerably less confidence than if you have zero variant reads for 1000 total reads.

To impute missing read counts, the best option is to refer to the original sequencing data (e.g., the BAM file) to determine how many reads were present at the locus in the given sample. As this task can be arduous, you may alternatively wish to take the mean read count across samples for that genomic locus, or the mean read count for all loci in that sample. More complex strategies that try to correct for variable depth across loci (e.g., because of GC bias) or differing depths across samples (e.g., because some samples were more deeply sequenced than others) are also possible.

Clustering mutations

Pairtree requires mutations be clustered into subclones. For an example of how to use Pairtree's clustering utilities, please see here. These clusters should be provided to Pairtree in the .params.json file using the clusters array, described above. Pairtree works best with thirty or fewer subclones, but can build (less accurate) clone trees for 100 subclones or more.

To build clusters, you have three options.

  1. Don't actually cluster mutations into subclones. That is, instead of building a clone tree, you can build a mutation tree instead in which each cluster contains only a single variant. You must still specify the clusters in your .params.json file, but you simply list a separate cluster for each variant. This works best for a small number of mutations (30 or fewer), and so is most suitable for sequencing data obtained from WES or targetted sequencing rather than WGS.

  2. Use one of many published clustering methods, such as PyClone-VI. You must translate these methods' outputs into Pairtree's format, but this is typically easy.

  3. Use Pairtree to cluster the variants.

Pairtree provides two clustering models. As Pairtree's focus is not on variant clustering, its clustering algorithms are less well-validated than its tree building, and so you may obtain better results with other algorithms. Nevertheless, Pairtree's authors have had reasonably good success with its cluster-building methods. Pairtree's two clustering models both use a Dirichlet process mixture model (DPMM), but differ in how they use the model.

You can use Pairtree to cluster variants through the bin/clustervars executable. Run with --help to obtain full usage instructions. Several options are of particular interest.

Clustering results are sensitive to the model and parameters used. The best choice of parameters depends on the nature of your data, including the number of variants, read depth, and other characteristics. It is often prudent to try a range of different parameters, plot the results for each using the bin/plotvars executable, and select the clustering that best represents your data. The best clustering can depend on your application -- for some uses, you may want more granular clusters that produce a detailed tree with many subclones, while in other uses, you may want to collapse together similar subclones to create a simpler tree in which it is easier to discern major structural differences.

Interpreting and manipulating Pairtree output

See here for further details on interpretting Pairtree's outputs.

Description of the .results.npz file format

Here we provide a brief description of the output file (.npz) generated by Pairtree. The .npz file type is a zipped archive consisting of a set of key, value pairs. The numpy.load function can be used to load the contents of an .npz file (see the numpy.load docs).

When the pairtree program completes its execution it saves a .npz file with details of the trees it found, it's parameter setup, and the results for some of the important computations it performed. We detail below some of the most important data Pairtree stores in the .npz file:

It's important to note that all of these lists are sorted in descending order based on tree log-likelihood. Therefore, the data at index 0 in each of these lists corresponds to the data for the best fitting tree.

Understanding the html file format outputted by bin/plottree

The bin/plottree script uses the .npz file outputted by bin/pairtree to produce an html of plots and figures describing the results.

The title and description for each plot/figure are provided below.

Exporting Pairtree output to other formats

Fixing incorrect ploidy

There are cases of incorrect ploidy that are unable to be detected using the pairwise framework. This can result in an implied subclonal frequency being greater than 1. This is often due to missed copy number calls, such as an uncalled loss of heterozygosity (LOH) event which removes the wildtype allele in a lineage and leaves only the variant allele. For an example of how to use the util/fix_bad_var_read_prob.py script, please see here.

In order to find these mutations such that they can be corrected, we provide the script util/fix_bad_var_read_prob.py. There are several options for this script which are of particular interest.

The terminal output of this script will be in the following format:

num_bad_ssms: Number of variants which have an incorrect variant read probability estimate

bad_ssms: List of variant names which were found to have an incorrect variant read probability estimate

bad_samp_prop: Percent of variants out of all samples which were found to have an incorrect variant read probability estimate

bad_ssm_prop: Percent of variants which were found to have an incorrect variant read probability estimate in at least one sample


Removing garbage mutations

Pairtree relies on the infinite sites assumption (ISA) when converting mutation allele frequencies to subclonal frequencies, and building clone trees. Pairtree is also sensitive to other factors such as missed CNA calls, or technical noise which can skew the subclonal frequency conversion. Therefore, it is necessary to detect and remove such mutations before building the clone tree. Pairwise relationships between mutations reveal both ISA violations and other types of erroneous mutations, which we refer to collectively as garbage mutations.

In order to find these mutations such that they can be added to our list of garbage mutations, we provide the script bin/removegarbage. There are several options for this script which are of particular interest.

Tweaking Pairtree options

Tweaking some of Pairtree's options can improve the quality of your results or reduce the method's runtime. The most relevant options are discussed below. For a full listing of tweakable Pairtree options, including algorithm hyperparameters, run pairtree --help.

Finding better trees by running multiple MCMC chains

Pairtree samples trees using MCMC chains. Like any optimization algorithm working with a non-convex objective, Pairtree's chains can become stuck in local optima. Running multiple MCMC chains helps avoid this, since each chain starts from a different point in space, and takes a different path through that space.

By default, Pairtree will run a separate MCMC chain for each CPU core (logical or physical) present. As these chains are run in parallel, using multiple chains does not increase runtime. For more details, please refer to the "Running computations in parallel to reduce runtime" section below.

Changing number of samples, burn-in, and thinning

Three options control the behaviour of each MCMC chain used to sample trees.

Running computations in parallel to reduce runtime

Pairtree can leverage multiple CPUs both when computing the pairwise relations tensor and when sampling trees. Exploiting parallelism when computing the tensor can drastically reduce Pairtree's runtime, since every pair can be computed independently of every other. Conversely, exploiting parallelism when sampling trees won't necessarily improve runtime, since trees are sampled using MCMC, which is an inherently serial process, While no single MCMC chain can be run in parallel, however, you can run multiple independent chains in parallel, as discussed in the previous section. This will help Pairtree avoid local optima and produce better-quality results.

By default, Pairtree will run in parallel on all present CPUs. (On hyperthreaded systems, this number will usually be inferred to be twice the number of physical CPU cores.) You can change this behaviour by specifying the --parallel=N option, causing Pairtree to use N parallel processes instead. When computing the pairwise tensor, Pairtree will compute all pairs in parallel, such that computing the full tensor using N parallel processes should only take 1/N as much time as computing with a single process. Later, when sampling trees, Pairtree will by default run as many independent MCMC chains as there are parallel processes. In this scenario, if you have N CPUs, running only a single chain will be no faster than running N chains, since each chain executes concurrently. However, in the N-chain case, you will obtain N times as many samples, improving your result quality.

You can explicitly specify --tree-chains=M to run M chains instead of the default, which is to run a separate chain for each parallel process. If M is greater than the number of parallel processes N, the chains will execute serially, increasing runtime. (Given the serial execution, you should expect that tree sampling with M > N will take ceiling(M / N) times as long as

Fitting subclone frequencies to trees

Critical to the Pairtree algorithm is the step of fitting subclone frequencies to trees. Beyond the user's interpretation of subclone frequencies for downstream analysis of Pairtree's results, the subclone frequencies will also affect tree structure -- the data likelihood for a tree is determined by how well the tree-constrained subclone frequencies fit the observed VAF data. We provide several different algorithms for computing these frequencies that strike different balances between computational speed and result accuracy.

Given the tradeoff between speed and accuracy, rprop is recommended for small trees (e.g., ten subclones or fewer), while projection is more suitable for larger trees. If Pairtree seems to be producing poor results with projection, try with rprop instead, as optimizing the exact objective rather than an approximation may produce better trees.

When running with --phi-fitter=rprop or --phi-fitter=proj_rprop, you can adjust the number of iterations the rprop algorithm uses by specifying the --phi-iterations=N parameter. Runtime of rprop will be proportional to this parameter, so smaller values will decrease runtime, at the potential cost of accuracy of subclone frequencies.

While other options are provided (--phi-fitter=graddesc_old and --phi-fitter=rprop_old), these remain only as artifacts and will be removed in future verisons, and so should not be used.