Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
You have a good question! Ideally, SNV calls should be phased so that you can know whether they lie on the duplicated allele or not. In the absence of phasing information, you might be able to get a hint of whether the SNV lies on the duplicated allele by examining its VAF. If you assume that the CNA and SNV are both clonal, and that your tumour purity is p, then, if the SNV's VAF is equal to about (1/3)p, you know that var_read_prob should be 0.33, and that the SNV happened after the duplication. If the SNV's VAF is (2/3)p, you know the SNV happened before the duplication, and it was on the allele that was duplicated, so that var_read_prob should be 0.66. Even if the SNV is subclonal, if its VAF exceeds (1/3)p by a substantial amount, you should be able to conclude that var_read_prob should be 0.66.
If the CNA is subclonal, or if the SNV is subclonal and belongs to a clone with a small subclonal frequency, then the problem becomes more difficult. The correct solution would be for Pairtree to sum over the possible phasings of the SNV, but we don't currently do that.
In the Rehman et al. paper you mentioned, we deliberately filtered out variants lying in CNA-affected regions, and took only the diploid variants. That's the same strategy we used in Dobson et al. (https://cancerdiscovery.aacrjournals.org/content/10/4/568.abstract). In general, I think CNA calling in bulk sequencing data is a horrific mess, and so I try to avoid dealing with it if I can. :) Given uncertainties about phasing, I've resigned myself to just trying to work with diploid regions for most questions concerning cancer evolution.
I hope this helps! Please let me know if you have any other questions.
You have a good question! Ideally, SNV calls should be phased so that you can know whether they lie on the duplicated allele or not. In the absence of phasing information, you might be able to get a hint of whether the SNV lies on the duplicated allele by examining its VAF. If you assume that the CNA and SNV are both clonal, and that your tumour purity is
p
, then, if the SNV's VAF is equal to about(1/3)p
, you know thatvar_read_prob
should be0.33
, and that the SNV happened after the duplication. If the SNV's VAF is(2/3)p
, you know the SNV happened before the duplication, and it was on the allele that was duplicated, so thatvar_read_prob
should be0.66
. Even if the SNV is subclonal, if its VAF exceeds(1/3)p
by a substantial amount, you should be able to conclude thatvar_read_prob
should be0.66
.If the CNA is subclonal, or if the SNV is subclonal and belongs to a clone with a small subclonal frequency, then the problem becomes more difficult. The correct solution would be for Pairtree to sum over the possible phasings of the SNV, but we don't currently do that.
In the Rehman et al. paper you mentioned, we deliberately filtered out variants lying in CNA-affected regions, and took only the diploid variants. That's the same strategy we used in Dobson et al. (https://cancerdiscovery.aacrjournals.org/content/10/4/568.abstract). In general, I think CNA calling in bulk sequencing data is a horrific mess, and so I try to avoid dealing with it if I can. :) Given uncertainties about phasing, I've resigned myself to just trying to work with diploid regions for most questions concerning cancer evolution.
I hope this helps! Please let me know if you have any other questions.
Originally posted by @jwintersinger in https://github.com/morrislab/pairtree/issues/3#issuecomment-778545577