Hi, Aaron! - Githubissues

          Hi, Aaron!

You have a good question! Ideally, SNV calls should be phased so that you can know whether they lie on the duplicated allele or not. In the absence of phasing information, you might be able to get a hint of whether the SNV lies on the duplicated allele by examining its VAF. If you assume that the CNA and SNV are both clonal, and that your tumour purity is p, then, if the SNV's VAF is equal to about (1/3)p, you know that var_read_prob should be 0.33, and that the SNV happened after the duplication. If the SNV's VAF is (2/3)p, you know the SNV happened before the duplication, and it was on the allele that was duplicated, so that var_read_prob should be 0.66. Even if the SNV is subclonal, if its VAF exceeds (1/3)p by a substantial amount, you should be able to conclude that var_read_prob should be 0.66.

If the CNA is subclonal, or if the SNV is subclonal and belongs to a clone with a small subclonal frequency, then the problem becomes more difficult. The correct solution would be for Pairtree to sum over the possible phasings of the SNV, but we don't currently do that.

In the Rehman et al. paper you mentioned, we deliberately filtered out variants lying in CNA-affected regions, and took only the diploid variants. That's the same strategy we used in Dobson et al. (https://cancerdiscovery.aacrjournals.org/content/10/4/568.abstract). In general, I think CNA calling in bulk sequencing data is a horrific mess, and so I try to avoid dealing with it if I can. :) Given uncertainties about phasing, I've resigned myself to just trying to work with diploid regions for most questions concerning cancer evolution.

I hope this helps! Please let me know if you have any other questions.

Originally posted by @jwintersinger in https://github.com/morrislab/pairtree/issues/3#issuecomment-778545577

morrislab / pairtree

Hi, Aaron! #44