math behind var_read_prob and other questions

THT-sleepy commented 1 year ago

Hi Ethan

I have 2 questions:

1) In the reply for issue "math behind var_read_prob" you said "In the case of a normal copy number K = 2, a single variant allele M = 1, and a correct VAF, then var_read_prob = 1/2 which should provide a good estimate of CF regardless of purity." I don't undertand it very well. As M is the average number copies of j variant in a cell,why M is not affected by purity? For example,assume that multiplicity of variant j is 1,no CNV,purity = 0.5,then M should be 0.5*1/2 =0.25 ? 2) I have some paired WGS samples(60x) and I want to reconstruct clone tree for each tumor. I'd like to use PairTree but it seems that clustering of variants can't be done very well with it, and other clustering tools like pyclone seems to be designed for high depth data, so which tool should I choose to do the clustering? Or should I choose phyloWGS which seems to fit WGS data more?

thanks

Best Huatao

ethanumn commented 1 year ago

Hi Huatao,

I can provide some brief responses right now that can hopefully clear up some of the confusion.

1. I think you reposted an issue where Jeff responded to a similar question. One approach is to only use data from diploid regions that haven't been impacted by copy number aberrations.

Alternatively, we can use the population average copy number of variant j (M), and the population copy number average of the locus containing j (N) to compute the variant read probability as var_read_prob = M/N. This was previous discussed in #17.

First, we assume that variant j only appears in the cancerous cell population and that it originally occurred in a diploid region. Let p be the sample purity, and let k be the number of loci that could contain variant j in the cancerous cell population. In a healthy cell population, the number of loci that could contain variant j is 2, and the number of loci in the cancerous cell population that could contain variant j is k. In this case, the population average copy number N is equal to N = 2(1-p) + kp. This is equivalent to N=2+(k-2)*p which is the the expression from #17 (distribute terms and it should be clear). It should be clear that M is not impacted by purity since it is by definition an average copy number in only the cancerous cell population.

If M and N can be computed with confidence, then you should be able to obtain reasonable CCF estimates using the variant allele frequency data for variant j.

2.

Since the clustering software that Pairtree has is easy to integrate with running Pairtree, I think it's worth trying. It's also fairly easy to convert the input files for Pairtree to be run with PyClone-VI, and then these clusters can be converted back to a format usable by Pairtree. We are currently working on tools to help with these types of reconstruction problems, some of which will be released very soon.

THT-sleepy commented 1 year ago

Thank you very much! I got it,M is multiplicity of the variant j. So according to the formula:CCFsnv=VAF[2(1-p)+p[2+(k-2)CCFcnv]]/pM, CCFsnv actually depends on M,k,CCFcnv(assuming p and VAF are already known), am I right? By the way,I really appreciate your work and I am looking foward to your new tools!

morrislab / pairtree

math behind var_read_prob and other questions #43