Which mutations does "node0" have??

oghzzang commented 6 months ago

Dear pairtree team, and Dr. @ethanumn.

I really appreciate this wonderful program.

However, I have several questions.

Q1. In the output of pairtree, the number of node has one more clusters than my input clusters. It's because the output has "node0". Then, how can I know which mutations does "node0" have??

Q2. And, can I think about mutations in node like this? node0 = ? node1 = mutations that is in cluster1 node2 = mutations that is in clstuer2

Q3. I coudn't understand the "Population frequencies" of pairtree (star-methods).

Q3-1. In this results, does Pop.0 mean the population frequencies of node0? Q3-2. Why does "sample3" have so small portion of Pop.0?

Q4. I read your paper, I saw this note. "Properly setting the var_read_prob is critical because Pairtree uses it to calculate the data-implied subclonal frequency as (var_reads / total_reads) / var_read_prob, representing the estimated proportion of cells bearing the mutation."

Then, can I estimate "var_read_prob" like this? var_read_prob = vaf / ccf

ccf: cancer cell fraction
vaf: variant allele fraction

Many thanks.

ethanumn commented 6 months ago

Hi there -

I'll answer your questions in the order you asked them.

Node 0 represents the germline clone. It does not contain any mutations from your analysis. All cancerous clones descend from some healthy germline cell, and that is what node 0 represents.
Yes, the nodes (node 1, node 2, etc.) correspond to the order of the clusters defined in your params file.
Yes, Pop 0 refers to node 0, Pop 1 refers to node 1, etc.. Since each sample is a mixture of different clones (clusters), the total frequency of clones in each sample must sum to 1. If none of the mutations included in your analysis have a cellular prevalence of 1.0 in a sample, then it's likely the sample contains some healthy cells, and this is why you'll see Node 0 (healthy cells) have a population frequency that is greater than 0. Equivalently, this implies the purity of the sample is not 1.0, i.e., the sample contains a mixture of healthy cells and cancerous cells.
If you knew the CCF, then you wouldn't need to estimate the VAF or the var_read_prob. The VAF and var_read_prob can be deduced from sequencing data, and given these values you can use the formula you wrote to estimate the ccf.

Please let me know if any of this is unclear.

-Ethan

oghzzang commented 5 months ago

Thanks a lot, @ethanumn .

I totally understood!!!

Then, I have several questions related to answer 4.

Q1. To be clear, I identified the definition of the words.

CCF: proportion of cancerous cells in a tumor containing a single-nucleotide variant

cellular prevalence: the portion of cancer cells harbouring a mutation, of the input sample.

var_read_prob: probability of observing a read of the mutation’s variant allele for cells bearing the mutation in a specific tissue sample

I think "var_read_prob" refers to the probability of observing a mutation in both normal cells and cancer cells. But "ccf" refers to the probability of observing a mutation in only cancer cells. So, I think these are different. Is it right??

var_read_prob != ccf
ccf == cellular_prevalance

Q2

If you knew the CCF, then you wouldn't need to estimate the VAF or the var_read_prob.

As I know, I need to estimate "var_read_prob" to run pairtree analysis. If I know the CCF of every variants, can I use "ccf" in the input of pairtree instead of "var_read_prob"? I think they are different.

Q3

The VAF and var_read_prob can be deduced from sequencing data,

As I know (I saw your manual & paper), if I estimate the "var_read_prob, I need to identify purity & allele specific copy number. Is it right? Or did you just consider total cn value like this (var_read_prob = 1/total_cn).

total cn = total copy number

Q4 In the case that the number of variant allele is 0, var_read_prob must be 0? I don't know whether I need to follow the information of each variants, or the information of subclusters (params, like pyclone-VI). the definition of "var_read_prob" is probability of observing a read of the mutation’s variant allele for cells bearing the mutation in a specific tissue sample.

Many thanks!!

ethanumn commented 5 months ago

Great, I'm glad. Here are some brief responses to your new questions.

1. This is a good clarification question. I updated my previous answer slightly. There is a slight mistake in your definitions. This issue here discusses some of the math behind var_read_prob. The definitions I'm going to use are going to be referenced from this article. I'll paraphrase them to provide a succinct response.

Here are the necessary definitions:

purity (p): fraction of cells in the sample that are cancerous.

variant allele frequency (VAF): percentage of reads for a genomic locus that contain the mutated allele.

total copy number of region in cancer cell population (N): the estimated average number of alleles (mutant + reference) for the genomic locus containing the SNV in the cancer cell population

multiplicity of a mutation (m): estimated number of alleles that contain the mutated allele, m = (VAF/p)(pN + 2(1-p)). I'll detail each part of this equation for clarity. 2(1 - p) is the number of alleles for the locus in the healthy cell population in the sample, pN is the estimated number of alleles for the locus in the cancerous cell population. Together, (pN + 2(1-p)) is the estimated average number of alleles for the locus among all cells (cancerous + healthy). VAF(pN + 2(1-p)) is the estimated fraction of alleles for the locus that contain the mutant allele in all cells (cancerous + healthy). Since the mutation should have only occurred in the cancerous cell population, we divide by the purity, (VAF/p)(pN + 2(1-p)), which gives us the estimated number of mutant alleles in the cancerous cell population.

variant read probability (var_read_prob): the estimated fraction of alleles for the locus containing the SNV that are the mutant allele, var_read_prob = m / (pN + 2(1-p)).

cellular prevalence (CP): fraction of all cells (healthy + cancerous) from the sample that carry the mutation, CP = VAF / var_read_prob.

cancer cell fraction (CCF): fraction of cancerous cells from the sequenced sample carrying the mutation, CCF = CP / p.

Putting all of this together, having the allele specific copy number estimate, purity, and VAF, you can calculate everything. CCF != CP, unless p=1.0.

If you know the CCF you can calculate the var_read_prob using the formula above. Pairtree's probabilistic framework uses the raw read counts and the implied VAF to evaluate the likelihood of the different tree structures. It requires the raw read counts and var_read_prob to translate between the different quantities.
I believe my answer in (1) already provides a response for this question.
If the mutation hasn't occurred, it's placement in an evolutionary tree wouldn't be informative. It might make more sense to remove it from the analysis. If it occurred in some samples but not others, then a var_read_prob=0 for those samples it did not occur is totally fine. I believe Pairtree can handle this input properly.

oghzzang commented 5 months ago

These super helpful comments saved me.

I totally understand them now. Thank you so much @ethanumn .

This question will be the final question :)

In your comment,

m= (VAF/p)(pN + 2(1-p)) --------------------------1 var_read_prob = m / (pN + 2(1-p)) ----------------2

then, var_read_prob is equal with "VAF/purity". -----3

CP = VAF / var_read_prob ------------------------4

then, var_read_prob is equal with "VAF/CP" ---------5

var_read_prob = VAF /purity ----------------------3 var_read_prob = VAF/CP --------------------------5

Following this equation, CP == purity, and CCF is 1. I think the equation related to CP is needed to edit.

In summary, In my opinion, var_read_prob is equal with "VAF/purity" like your comment, -----3 however, CP will be VAF * CCF/var_read_prob. ----------- (my suggestion)

Actually, I think your response will be right, but i want to know your thoughts. I'll wait for your response.

Best,

Hayley

ethanumn commented 5 months ago

Hi Hayley -

So there's a little bit of detail I left out. The multiplicity of a mutation is assumed to be an integer. One way it can be rewritten is:

m = round( ( VAF / p)(pN + 2(1-p)) )

In the supplement of one of the papers I referenced above, they do a brief example. There are also methods that try to infer m probabilistically instead of using a formula like this.

Using this updated definition of m, things will not cancel out in the definition of var_read_prob:

var_read_prob = m / (pN + 2(1-p))

This results in

CP = VAF / ( m / (pN + 2(1-p)) )

and as follows

CCF = VAF / ( (pm) / (pN + 2(1-p)) )

So if p = 1.0, then CP == CCF.

oghzzang commented 5 months ago

I got it finally. Thank you so much!

@ethanumn

morrislab / pairtree

Which mutations does "node0" have?? #47