calculating var_read_prob column?

ahgillmo commented 3 years ago

Hello, I am interested in trying pair tree and I have some questions regarding the SSM input file generation.

In the input SSM file under the var_read_prob column. I understand that this is the probability of the variant allele. In the documentation example, the reference allele had a copy number of 2 and the alternate allele had a copy number of 1, making the var_read_prob = 0.33. I am unsure how to determine which allele is modified in the event of a CNA.

I have two questions:

1) Could you provide suggestions on which copy number callers you used and/or how to infer which allele is copied in the event of a copy number aberration.

2) In the supplementary file of Colorectal Cancer Cells Enter a Diapause-like DTP State to Survive Chemotherapy (Supp file: Genetic Intratumoral Heterogeneity Analysis final.pdf) the var.read.probs are always set to 0.5, which represents the diploid regions as mentioned in the paper. In the absence of allele aware copy number data, is picking CN neutral variants the correct way to run PairTree?

Thank you for your time, Aaron

jwintersinger commented 3 years ago

Hi, Aaron!

You have a good question! Ideally, SNV calls should be phased so that you can know whether they lie on the duplicated allele or not. In the absence of phasing information, you might be able to get a hint of whether the SNV lies on the duplicated allele by examining its VAF. If you assume that the CNA and SNV are both clonal, and that your tumour purity is p, then, if the SNV's VAF is equal to about (1/3)p, you know that var_read_prob should be 0.33, and that the SNV happened after the duplication. If the SNV's VAF is (2/3)p, you know the SNV happened before the duplication, and it was on the allele that was duplicated, so that var_read_prob should be 0.66. Even if the SNV is subclonal, if its VAF exceeds (1/3)p by a substantial amount, you should be able to conclude that var_read_prob should be 0.66.

If the CNA is subclonal, or if the SNV is subclonal and belongs to a clone with a small subclonal frequency, then the problem becomes more difficult. The correct solution would be for Pairtree to sum over the possible phasings of the SNV, but we don't currently do that.

In the Rehman et al. paper you mentioned, we deliberately filtered out variants lying in CNA-affected regions, and took only the diploid variants. That's the same strategy we used in Dobson et al. (https://cancerdiscovery.aacrjournals.org/content/10/4/568.abstract). In general, I think CNA calling in bulk sequencing data is a horrific mess, and so I try to avoid dealing with it if I can. :) Given uncertainties about phasing, I've resigned myself to just trying to work with diploid regions for most questions concerning cancer evolution.

I hope this helps! Please let me know if you have any other questions.

THT-sleepy commented 9 months ago

Hi, Aaron!

You have a good question! Ideally, SNV calls should be phased so that you can know whether they lie on the duplicated allele or not. In the absence of phasing information, you might be able to get a hint of whether the SNV lies on the duplicated allele by examining its VAF. If you assume that the CNA and SNV are both clonal, and that your tumour purity is p, then, if the SNV's VAF is equal to about (1/3)p, you know that var_read_prob should be 0.33, and that the SNV happened after the duplication. If the SNV's VAF is (2/3)p, you know the SNV happened before the duplication, and it was on the allele that was duplicated, so that var_read_prob should be 0.66. Even if the SNV is subclonal, if its VAF exceeds (1/3)p by a substantial amount, you should be able to conclude that var_read_prob should be 0.66.

If the CNA is subclonal, or if the SNV is subclonal and belongs to a clone with a small subclonal frequency, then the problem becomes more difficult. The correct solution would be for Pairtree to sum over the possible phasings of the SNV, but we don't currently do that.

In the Rehman et al. paper you mentioned, we deliberately filtered out variants lying in CNA-affected regions, and took only the diploid variants. That's the same strategy we used in Dobson et al. (https://cancerdiscovery.aacrjournals.org/content/10/4/568.abstract). In general, I think CNA calling in bulk sequencing data is a horrific mess, and so I try to avoid dealing with it if I can. :) Given uncertainties about phasing, I've resigned myself to just trying to work with diploid regions for most questions concerning cancer evolution.

I hope this helps! Please let me know if you have any other questions.

Hello, Jeff! May I ask how do you deal with the condition that potential driver mutations lies on non-diploid regions(assuming that we only include mutations on diploid regions when clustering mutations)?Is a post hoc assignment possible?

morrislab / pairtree

calculating var_read_prob column? #3