Pairtree infers the phylogeny underlying a cancer using genomic mutation data. Pairtree is particularly suited to settings with multiple tissue samples from each cancer, providing separate estimates of subclone frequency from each sample that constrain the set of consistent phylogenies. The algorithm consists of two phases:
Compute pairwise relation tensor over all mutations (or clusters of
mutations). This provides the probability over each of four possible
evolutionary relationships between each pair of mutations A
and B
.
Use the pairwise relation tensor to sample possible phylogenies, assigning a likelihood to each. As each phylogeny is sampled, Pairtree computes the subclone frequency of each mutation (or cluster of mutations) within the tree, balancing the need to fit the observed mutation data while still obeying tree constraints.
The algorithm is described in Reconstructing complex cancer evolutionary histories from multiple bulk DNA samples using Pairtree (Wintersinger et al. (2022)).
The practical aspects of using Pairtree are described in our STAR protocol paper Reconstructing cancer phylogenies using Pairtree, a clone tree reconstruction algorithm (Kulman et al. (2022)).
For more details on installation, please see here.
Install dependencies. Installation is usually easiest if you use Mambaforge, which includes a recent version of Python alongside the Mamba package manager. As Mamba is a faster reimplementation of Conda, you can alternatively use Conda or Miniconda.
Refer to requirements.txt for the full list of
dependencies. These can be installed in a new environment with the following
commands. If you're using Anaconda instead of Mamba, replace mamba
with
conda
below.
To install all requirements using mamba
(or conda
), first clone
the Pairtree repository and then create a new virtual environment with the
dependencies.
git clone https://github.com/jwintersinger/pairtree
mamba create --name pairtree --file requirements.txt
Or install the dependencies in your current environment:
git clone https://github.com/jwintersinger/pairtree
conda create -n pairtree --file requirements.txt --yes
conda activate pairtree
Pairtree has only been tested on Linux systems, but should work on any UNIX-like OS (including macOS).
Download and build the C code required to fit subclone frequencies to the tree. This algorithm was published in Jia et al., and uses the authors' implementation with minor modifications.
git clone https://github.com/jwintersinger/pairtree
cd pairtree/lib
git clone https://github.com/ethanumn/projectppm
cd projectppm
bash make.sh
Download plotting requirements. Unfortunately, not all plotting requirements Pairtree uses are available via conda. Instead, we recommend downloading them via pip. (optional)
pip3 install -r plot_requirements.txt
After installing Pairtree, you can test your installation using provided example data.
PTDIR=$HOME/path/to/pairtree
cd $PTDIR/example
mkdir results && cd results
# Run Pairtree.
$PTDIR/bin/pairtree --params $PTDIR/example/example.params.json $PTDIR/example/example.ssm example.results.npz
# Plot results in an HTML file.
$PTDIR/bin/plottree --runid example $PTDIR/example/example.ssm $PTDIR/example/example.params.json example.results.npz example.results.html
# View the HTML file.
open example.results.html
For a more detailed example of how to run Pairtree and interpret the results, please see our STAR Protocols paper Reconstructing cancer phylogenies using Pairtree, a clone tree reconstruction algorithm.
Pairtree provides several executable files to be used at different stages of
your clone-tree-building pipeline. You can run each executable with the
--help
flag to get full usage instructions.
bin/clustervars
: cluster variants into subclones suitable for building
clone trees with bin/pairtree
. Please see the Clustering
mutations section for full instructions.
bin/plotvars
: plot information about your variant clusters before
proceeding with your Pairtree run. This step is optional. If
--plot-relations
is specified, the pairwise relations between subclones
will be computed and plotted, but at a considerable computational cost.
bin/pairtree
: the main Pairtree executable, used for building clone trees.
See the Input files and Tweaking Pairtree
options sections for more information specific
to this step.
bin/plottree
: plot results of your Pairtree run. This will plot a single
clone tree, which by default will be the one with highest likelihood. See
Interpreting Pairtree output for details.
bin/summposterior
: summarize the posterior distribution over clone trees
from your Pairtree run by representing this distribution as a graph, and by
showing the best individual tree samples. See Interpreting Pairtree
output for details.
Pairtree requires two input files. See here for more details.
Pairtree needs a .ssm
(i.e., simple somatic mutation) file, which provides
integer counts of the number of variant and total reads at each mutation locus
in each tissue sample. The .ssm
file is tab-delimited, with the first line
serving as a header, and each subsequent line providing data for one mutation.
You may wish to consult an example .ssm file.
The .ssm
file contains the following fields:
id
: a unique ID used to identify the variant. The first mutation should
be assigned ID s0
, with subsequent variants assigned s1, s2, ...
.
Generally, these should be contiguous, but this is not enforced, so you
should be able to delete variants from this file without adjusting
subsequent IDs. The s
prefix indicates that the ID corresponds to an
SSM.
name
: a text string used to label variants in Pairtree's output. This
can be a gene name, genome position (e.g., 1_12345
may be used to
represent that the variant occurred on chr1
at position 12345
), or any
other string. Spaces may be used, and uniqueness of name
s is not
enforced (though duplicate names may lead to confusing output).
var_reads
: a comma-separated vector of how many whole-number-valued
genomic reads at this mutation's locus corresponded to the variant allele
in each tissue sample. Tissue samples may be provided in any order, so long
as this order is consistent for all mutations. The names of the associated
tissue samples are given in the .params.json
file, detailed below. If
running with only a single tissue sample, this field contains only a single
scalar value for each mutation. If a variant has been called in only a
subset of your tissue samples, consult the Imputing read
counts
section.
total_reads
: a comma-separated vector of the total number of
whole-number-valued genomic reads observed at this mutation's locus in each
tissue sample. Thus, the variant allele frequency (VAF) is given by
var_reads / total_reads
. If a variant has been called in only a subset of
your tissue samples, consult the Imputing read
counts
section.
var_read_prob
: a comma-separted vector of probabilities \omega_{j1}, \omega_{j2}, ..., \omega_{js}, ..., \omega_{jS}
, where S
is the total
number of tissue samples. The value \omega_{js}
indicates the probabilty
of observing a read from the variant allele for mutation j
in a cell
bearing the mutation. Thus, if mutation j
occurred at a diploid locus,
we should have \omega_{js} = 0.5
. In a haploid cell (e.g., male sex
chromosome), we should have \omega_{js} = 1.0
. If a copy-number
aberration (CNA) duplicated the reference allele in the lineage bearing
mutation j
prior to j
occurring, there will be two reference alleles
and a single variant allele in all cells bearing j
, such that
\omega_{js} = 0.3333
. Thus, these values give Pairtree the ability to
correct for the effect CNAs have on the relationship between VAF (i.e.,
the proportion of alleles bearing the mutation) and subclonal frequency
(i.e., the proportion of cells bearing the mutation).
Pairtree also needs a .params.json
file, which provides parameters for the Pairtree run.
You may wish to consult an example .params.json file.
The .params.json
file denotes three objects:
samples
(required): list of sample names, provided in the same order as
entries in the var_reads
and total_reads
vectors in the .ssm
file. Each
sample name can be an arbitrary string.
clusters
(required): list-of-lists denoting variant clusters, each of which
corresponds to a subclone. Each sub-list contains strings corresponding to
the IDs of all variants that belong to a given cluster. All variants
belonging to a cluster are assumed to share the same subclonal frequency in a
given tissue sample, with the VAFs of each variant serving as a noisy
estimate of this subclonal frequency after using var_read_prob
to correct
for ploidy. See the section on clustering mutations
for instructions on how to use Pairtree to build clusters. Pairtree assumes
that all variants within a cluster share the same subclonal frequencies
across all cancer samples. Ploidy-corrected estimates of these frequencies
can be obtained by dividing a variant's VAF by its variant read probability
`omega.
garbage
(optional): list of IDs in the .ssm
file that you do not want
Pairtree to use in building your clone tree. Specifying variants as garbage
can be useful when you want to exclude variants without having to remove them
from the .ssm
file, allowing you to easily re-introduce them to your clone
tree in the future. Typically, you might remove a variant because you've
determined that it violates the infinite sites assumption (ISA); because you
think the associated read count data is likely erroneous; or because the
variant was subject to a complex copy-number configuration that skews the
relationship between VAF and subclonal frequency (e.g., the variant appears
in different copy-number states in different subclones).
When some of your mutations appear in only some of your samples, you must impute the total read count for those mutations in the samples where they're not present. While the variant read count should clearly be zero in such cases, the value the total read count should take is unclear. This task is important, however, because it informs Pairtree how confident you are that the variant is indeed not present in the sample -- an instance where you have zero variant reads amongst ten total reads indicates considerably less confidence than if you have zero variant reads for 1000 total reads.
To impute missing read counts, the best option is to refer to the original sequencing data (e.g., the BAM file) to determine how many reads were present at the locus in the given sample. As this task can be arduous, you may alternatively wish to take the mean read count across samples for that genomic locus, or the mean read count for all loci in that sample. More complex strategies that try to correct for variable depth across loci (e.g., because of GC bias) or differing depths across samples (e.g., because some samples were more deeply sequenced than others) are also possible.
Pairtree requires mutations be clustered into subclones. For an example of how to use
Pairtree's clustering utilities, please see here. These clusters should
be provided to Pairtree in the .params.json
file using the clusters
array,
described above. Pairtree works best with thirty or fewer subclones, but can
build (less accurate) clone trees for 100 subclones or more.
To build clusters, you have three options.
Don't actually cluster mutations into subclones. That is, instead of
building a clone tree, you can build a mutation tree instead in which
each cluster contains only a single variant. You must still specify the
clusters in your .params.json
file, but you simply list a separate cluster
for each variant. This works best for a small number of mutations (30 or
fewer), and so is most suitable for sequencing data obtained from WES or
targetted sequencing rather than WGS.
Use one of many published clustering methods, such as PyClone-VI. You must translate these methods' outputs into Pairtree's format, but this is typically easy.
Use Pairtree to cluster the variants.
Pairtree provides two clustering models. As Pairtree's focus is not on variant clustering, its clustering algorithms are less well-validated than its tree building, and so you may obtain better results with other algorithms. Nevertheless, Pairtree's authors have had reasonably good success with its cluster-building methods. Pairtree's two clustering models both use a Dirichlet process mixture model (DPMM), but differ in how they use the model.
You can use Pairtree to cluster variants through the bin/clustervars
executable. Run with --help
to obtain full usage instructions. Several
options are of particular interest.
ssm_fn
: Input SSM file with mutations.
in_params_fn
: Input params file listing sample names and any existing garbage mutations.
out_params_fn
: Output params file with cluster assignments for each mutation (excluding those listed in garbage mutations).
--model linfreq
: This uses the default "lineage frequency" clustering model.
This is the most computationally efficient choice, and is the most
straightforwrd application of the DPMM to the clustering problem.
--model pairwise
: This uses the alternative "pairwise relationships"
clustering model. In some instances, it may produce better results than
linfreq
. This model is typically slower to use, as it requires computing
pairwise relationships between mutation pairs. For M
mutations, there are
(M choose 2)
such pairs, and so this may become frustratingly slow for more
than 1000 mutations.
--prior <value>
. This is used only with --model pairwise
. This option
specifies the prior probability for the coclustering relationship. The
default of 0.25
corresponds to a uniform relationship over the four
relationship types (i.e., coclustering, A
is ancestral to B
, B
is
ancestral to A
, and A
and B
are on different branches). Higher values
typically result in fewer, larger clusters. Valid values range between 0 and 1.
--concentraton <value>
. This is the log_10{alpha}
used in the DPMM.
Larger values typically result in more, smaller clusters.
Clustering results are sensitive to the model and parameters used. The best
choice of parameters depends on the nature of your data, including the number
of variants, read depth, and other characteristics. It is often prudent to try
a range of different parameters, plot the results for each using the
bin/plotvars
executable, and select the clustering that best represents your
data. The best clustering can depend on your application -- for some uses, you
may want more granular clusters that produce a detailed tree with many
subclones, while in other uses, you may want to collapse together similar
subclones to create a simpler tree in which it is easier to discern major
structural differences.
See here for further details on interpretting Pairtree's outputs.
Here we provide a brief description of the output file (.npz
) generated by Pairtree. The .npz
file type is a zipped archive consisting of a set of key, value pairs. The numpy.load
function
can be used to load the contents of an .npz
file (see the numpy.load docs).
When the pairtree
program completes its execution it saves a .npz
file with details
of the trees it found, it's parameter setup, and the results for some of the important computations it performed.
We detail below some of the most important data Pairtree stores in the .npz
file:
struct
: a list of parents
vectors sorted by the log-likelihood of how well the tree described by the parents
vector fits the data. See the following source code to see how to generate an equivalent adjacency matrix from a parents
vector.
count
: a list containing the number of times a tree (or equivalently a parents
vector) was found during tree search. The number at index i in the count
list corresponds to the number of times the tree at index i in the struct
list was found during tree search.
phi
: a list of matrices, where the matrix at index i contains the tree constrained subclonal frequency fit to the tree at index i in the struct
list. Each row of the phi matrix corresponds to a cluster, and each column corresponds to its tree constrained subclonal frequencing in a particular sample.
llh
: a list containing the tree log-likelihoods. The value at index i in the llh
list corresponds to the log-likelihood of the tree at index i in the struct
list.
prob
: a list containing the tree posterior probabilities. The value at index i in the prob
list corresponds to the posterior probability of the tree at index i in the struct
list.
It's important to note that all of these lists are sorted in descending order based on tree log-likelihood. Therefore, the data at index 0 in each of these lists corresponds to the data for the best fitting tree.
bin/plottree
The bin/plottree
script uses the .npz file outputted by bin/pairtree
to produce an html of plots and figures describing the results.
The title and description for each plot/figure are provided below.
Tree
: shows a graphical representation of the clonal evolution tree based on the
variant clusters provided to bin/pairtree
in the parameter file. These clusters are each labeled as their own
subclone population (S0
, S1
, ...). There are three different additional values provided with the tree.
tidx
is the tree index of which tree from the .npz file is being displayed.
nlglh
is the negative log-likelihood of the tree.
prob
is posterior softmax probability of the tree.
Population frequencies
: shows the ratio of optimal subclone population frequencies when constrained by the proposed tree for each
sample specified in the .params.json
file provided to bin/pairtree
.
There are two additional values for each sample: Clone diversity index
and Clone and mutation diversity index
.
The Clone diversity index
is the entropy of subclone population frequency for a given sample.
This is a value from 0 to 1, where 1 means a sample has many different subclone populations (and as a result, a higher entropy), and 0 means a sample has no diversity (or a single subclone, and a lower entropy). The Clone and mutation diversity index
is the joint entropy of the subclone population frequency and the number of mutations present in a clone.
Tree-constrained subclonal frequencies
: shows the optimal subclone population frequency when constrained by the proposed tree.
Data-implied subclonal frequencies
: shows the implied subclone population frequency per sample based on the variant data provided.
Interleaved subclonal frequencies
: shows the tree-constrained subclonal frequencies
, as well as the error between the tree-constrained subclonal frequencies
and the data-implied subclonal frequencies
for each subclone in a given sample. The Total Error
value shown is the sum of all errors across subclones and samples.
Diversity Indices
: shows the Clone diversity index
, Clone and mutation diversity index
,
and Shannon diversity index
in bits for each sample. The Clone diversity index
, and
Clone and mutation diversity index
are described above. The Shannon diversity index
is
the entropy of the proportion of mutations in each cluster.
VAFs (corrected)
: shows a table of variant allele frequencies (VAFs) per sample for the following: tree-constrained subclones (Φ
), Supervariants
, Cluster members
, and Garbage
. The current selections for the table are highlighted in blue. The Φ
values are the optimal tree-constrained subclones VAFs (or pseudo-variants denoted by P1, ... ,P_n). The Supervariants
are the supervariants created for each cluster defined in the .params.json
file provided to Pairtree. The Cluster members
are the individual variants in each supervariant or cluster. The Garbage
variants are those listed as garbage in the .params.json
file provided to Pairtree. There is also a search bar which allows you to list comma delimited ID values that specify which IDs should be shown in the VAF table (e.g. s0,s1
will show only these two IDs if there is a match).
ML relations
: shows a matrix of the pairwise relationships between all of the different subclone populations. Each relationship is with respect to the subclone on the left hand side of the matrix, to the subclone listed on top of the matrix.
garbage
: shows a pairwise relationship matrix of which subclone populations have a garbage relationships.
cocluster
: shows a pairwise relationship matrix of which subclone populations are clustered together (or have a cocluster relationship).
A_B
: shows a pairwise relationship matrix of which subclone populations have an ancestral relationship to other subclone populations.
B_A
: shows a pairwise relationship matrix of which subclone populations have a descendant relationship to other subclone populations.
diff_branches
: shows a pairwise relationship matrix of which subclone populations are not on the same branch (i.e. are not ancestors or descendants of each other).
Cluster stats
: shows a table of statistics for each cluster. The columns are Cluster
, Members
, and Deviation
. The Cluster
column lists the ID number for the cluster or subclone. Members
lists the number of variants in a cluster. Deviation
lists the median absolute difference between the subclone frequency of the supervariant, and its related cluster of variants.
Use bin/summposterior
to summarize the posterior distribution over trees
and bin/plottree
to plot all details corresponding to an individual tree.
See the Pairtree executables section for details.
To export Pairtree's visualizations from bin/plottree
for use in
publications, try SVG Crowbar. All
figures Pairtree creates embedded in HTML web pages using the SVG figure
format, which is suitable for use at arbitrarily high resolutions (including
in print) because of its vector-based nature. Some of these figures are
rendered with the Plotly plotting library, though most use custom code that
ultimately invokes d3.js.
Alternatively, to export the subset of Pairtree's visualizations that use Plotly without having to open the
HTML plots in a web browser, try
Kaleido, which is a Plotly-related
library for exporting plots without using a browser (accomplished using an
embedded version of the Chromium web browser). Most Plotly-related calls in
Pairtree call plotly.io.to_html(fig, ...)
, which can be replaced by
fig.write_image
.
Unfortunately, for the other plots that don't use Plotly (such as the tree structure and subclonal frequency plots), there is no straightforward means of rendering them without using a web browser. If SVG Crowbar fails or is otherwise not a viable option, try printing the web page containing the plots to a PDF file, then opening the PDF in Inkscape. The SVG plts on the web page should be stored using a vector representation in the PDF, allowing you to copy them from the PDF into a discrete SVG file where they can be manipulated as vector images.
To export the data that bin/plottree
visualizes for your own analysis, pass
the --tree-json <filename>.json
option to bin/plottree
. This will export
the data for a given tree to the <filename>.json
file. For a tree with K
nodes (i.e., the root node representing normal tissue and K-1
cancerous
subclones) constructed from S
cancer samples, some of the data included
will be amongst the following.
phi
: tree-constrained subclonal frequencies as KxS
matrix in row-major formatphi_hat
: data-implied subclonal frequencies as KxS
matrix in row-major formateta
: population frequencies as KxS
matrix in row-major formatllh
: tree log likelihood in base enlglh
: tree log likelihood normalized to number of subclones and cancer samples in base 2, allowing comparison of trees across datasets to determine how many bits you're paying per subclone per cancer sampleprob
: posterior probability of treecount
: how many times this tree was sampled in Metropolis-Hastingsparents
: K-1
-length vector, where parents[i]
provides the parent of node i+1
samples
: names of cancer samples given in input (providing the names associated with each column of the phi
, phi_hat
, and eta
matrices)cdi
, cmdi
, sdi
: S
-length vectors providing the clonal dviersity index, clone and mutation diversity index, and Shannon diversity index for each sample. Refer to the Pairtree paper for definitions.If stderr
is redirected to a file (e.g., via bin/pairtree ... 2> stderr.log
), instead of rendering a progress bar, Pairtree will report its
progress by writing to stderr
a series of JSON objects, with one per line.
An example line follows:
{"desc": "Sampling trees", "count": 11457, "total": 12000, "unit": "tree", "started_at": "2021-11-01 02:45:37.226015", "timestamp": "2021-11-01 02:46:12.625482"}
There are cases of incorrect ploidy that are unable to be detected using the pairwise framework.
This can result in an implied subclonal frequency being greater than 1. This is often due to missed
copy number calls, such as an uncalled loss of heterozygosity (LOH) event which removes the wildtype
allele in a lineage and leaves only the variant allele. For an example of how to use the
util/fix_bad_var_read_prob.py
script, please see here.
In order to find these mutations such that they can be corrected, we provide the script util/fix_bad_var_read_prob.py
. There are several options for this script which are of particular interest.
--logbf-threshold
: Logarithm of Bayes factor threshold at which the haploid model is accepted as more likely model than the model using the provided var_read_prob.--ignore-existing-garbage
: Ignore any existing garbage variants listed in in_params_fn
and test all variants. If not specified, any existing garbage variants will be kept as garbage and not tested again.--action (add_to_garbage, modify_var_read_prob)
: Action to take after script has completed. add_to_garbage
will add the resulting 'bad_ssms' to the list of garbage samples in the params.json file provided. modify_var_read_prob
will overwrite the var_read_prob in the .ssm file with the value provided by the --var-read-prob-alt
option of this script.--var-read-prob-alt
: Value that will be used to overwrite the var_read_prob of any garbage variants found if --action
is set to modify_var_read_prob
.in_ssm_fn
: Input SSM file with mutations.in_params_fn
: Input params file listing sample names and any existing garbage mutations.out_ssm_fn
: Output SSM file with modified list of LOH mutations.out_params_fn
: Output params file with modified list of garbage mutations.The terminal output of this script will be in the following format:
num_bad_ssms
: Number of variants which have an incorrect variant read probability estimate
bad_ssms
: List of variant names which were found to have an incorrect variant read probability estimate
bad_samp_prop
: Percent of variants out of all samples which were found to have an incorrect variant read probability estimate
bad_ssm_prop
: Percent of variants which were found to have an incorrect variant read probability estimate in at least one sample
Pairtree relies on the infinite sites assumption (ISA) when converting mutation allele frequencies to subclonal frequencies, and building clone trees. Pairtree is also sensitive to other factors such as missed CNA calls, or technical noise which can skew the subclonal frequency conversion. Therefore, it is necessary to detect and remove such mutations before building the clone tree. Pairwise relationships between mutations reveal both ISA violations and other types of erroneous mutations, which we refer to collectively as garbage mutations.
In order to find these mutations such that they can be added to our list of garbage mutations, we provide the script bin/removegarbage
. There are several options for this script which are of particular interest.
--prior
: Pairwise garbage prior probability. The default value of 0.2 represents a uniform prior over pairwise relationship types.--max-garb-prob
: Maximum probability of a garbage relationship to permit for any pair in order for the algorithm to terminate.--ignore-existing-garbage
: Ignore any existing garbage variants listed in in_params_fn
and test all variants. If not specified, any existing garbage variants will be kept as garbage and not tested again.--pairwise-results
: Filename to store pairwise evidence in, which allows you to try garbage removal with different parameters without having to repeat the pairwise computations.in_ssm_fn
: Input SSM file with mutations.in_params_fn
: Input params file listing sample names and any existing garbage mutations.out_params_fn
: Output params file with modified list of garbage mutations.Tweaking some of Pairtree's options can improve the quality of your results or
reduce the method's runtime. The most relevant options are discussed below.
For a full listing of tweakable Pairtree options, including algorithm
hyperparameters, run pairtree --help
.
Pairtree samples trees using MCMC chains. Like any optimization algorithm working with a non-convex objective, Pairtree's chains can become stuck in local optima. Running multiple MCMC chains helps avoid this, since each chain starts from a different point in space, and takes a different path through that space.
By default, Pairtree will run a separate MCMC chain for each CPU core (logical or physical) present. As these chains are run in parallel, using multiple chains does not increase runtime. For more details, please refer to the "Running computations in parallel to reduce runtime" section below.
Three options control the behaviour of each MCMC chain used to sample trees.
Number of MCMC samples: by changing the --trees-per-chain
option, you can
make each chain sample more or fewer trees. The more samples each chain
takes, the more likely it is that those samples will be from a good
approximation of the true posterior distribution over trees permitted by your
data. By default, Pairtree will take 3000 total samples per chain.
Burn-in: each chain will take some number of samples to reach high-density
regions of tree space, such that those trees come from a good approximation
of the true posterior. The --burnin
parameter controls what proportion of
trees Pairtree will discard from the beginning of each chain, so as to avoid
including poor-quality trees in your results. By default, Pairtree discards
the first one-third of samples from each chain.
Thinning: MCMC samples inherently exhibit auto-correlation, since each
successive sample is based on a perturbation to the preceding one. We can
reduce this effect by taking multiple samples for each one we record. You can
change the --thinned-frac
option to control this behaviour. By default,
--thinned-frac=1
, such that every sample taken is recorded. By changing
this, for instance, to --thinned-frac=0.1
, Pairtree will only record every
tenth sample it takes. By recording only a subset of samples taken, the
computational burden associated with processing the results (e.g., by
computing summary statistics over the distribution of recorded trees) is
reduced, alongside the storage burden of writing many samples to disk.
Pairtree can leverage multiple CPUs both when computing the pairwise relations tensor and when sampling trees. Exploiting parallelism when computing the tensor can drastically reduce Pairtree's runtime, since every pair can be computed independently of every other. Conversely, exploiting parallelism when sampling trees won't necessarily improve runtime, since trees are sampled using MCMC, which is an inherently serial process, While no single MCMC chain can be run in parallel, however, you can run multiple independent chains in parallel, as discussed in the previous section. This will help Pairtree avoid local optima and produce better-quality results.
By default, Pairtree will run in parallel on all present CPUs. (On
hyperthreaded systems, this number will usually be inferred to be twice the
number of physical CPU cores.) You can change this behaviour by specifying the
--parallel=N
option, causing Pairtree to use N
parallel processes instead.
When computing the pairwise tensor, Pairtree will compute all pairs in
parallel, such that computing the full tensor using N
parallel processes
should only take 1/N
as much time as computing with a single process. Later,
when sampling trees, Pairtree will by default run as many independent MCMC
chains as there are parallel processes. In this scenario, if you have N
CPUs,
running only a single chain will be no faster than running N
chains, since
each chain executes concurrently. However, in the N
-chain case, you will
obtain N
times as many samples, improving your result quality.
You can explicitly specify --tree-chains=M
to run M
chains instead of the
default, which is to run a separate chain for each parallel process. If M
is
greater than the number of parallel processes N
, the chains will execute
serially, increasing runtime. (Given the serial execution, you should expect
that tree sampling with M > N
will take ceiling(M / N)
times as long as
Critical to the Pairtree algorithm is the step of fitting subclone frequencies to trees. Beyond the user's interpretation of subclone frequencies for downstream analysis of Pairtree's results, the subclone frequencies will also affect tree structure -- the data likelihood for a tree is determined by how well the tree-constrained subclone frequencies fit the observed VAF data. We provide several different algorithms for computing these frequencies that strike different balances between computational speed and result accuracy.
--phi-fitter=projection
: By default, Pairtree uses the "Efficient
Projection onto the Perfect Phylogeny Model" algorithm developed in Jia et
al. This uses a Gaussian approximation of
the binomial likelihood we're trying to maximize.
--phi-fitter=rprop
: Uses the rprop variant of gradient
descent. Separate step sizes
are maintained for each frequency scalar in each tissue sample. Those step
sizes are increased when the direction of a step is consistent with the
previous step taken, and reduced otherwise. Generally, this algorithm will
produce more accurate results than projection
(i.e., higher data
likelihoods for the same tree), since it is directly optimizing the binomial
objective rather than a Gaussian approximation, but at a higher computational
cost. The extra computational burden becomes greater with more subclones
(e.g., 30 or more), while being insignificant for small trees (e.g., with ten
subclones or fewer).
--phi-fitter=proj_rprop
: First run the default projection
algorithm, then refine its
results using rprop
. While this can produce better subclone frequency values
at the cost of more computation, it has in testing demonstrated little
benefit to accuracy.
Given the tradeoff between speed and accuracy, rprop
is recommended for small
trees (e.g., ten subclones or fewer), while projection
is more suitable for
larger trees. If Pairtree seems to be producing poor results with projection
,
try with rprop
instead, as optimizing the exact objective rather than an
approximation may produce better trees.
When running with --phi-fitter=rprop
or --phi-fitter=proj_rprop
, you can
adjust the number of iterations the rprop
algorithm uses by specifying the
--phi-iterations=N
parameter. Runtime of rprop
will be proportional to this
parameter, so smaller values will decrease runtime, at the potential cost of
accuracy of subclone frequencies.
While other options are provided (--phi-fitter=graddesc_old
and
--phi-fitter=rprop_old
), these remain only as artifacts and will be removed
in future verisons, and so should not be used.