raphael-group / decifer

DeCiFer is an algorithm that simultaneously selects mutation multiplicities and clusters SNVs by their corresponding descendant cell fractions (DCF).
BSD 3-Clause "New" or "Revised" License
20 stars 7 forks source link

Interpreting DeCiFer output for downstream analysis #20

Open Inceid opened 2 years ago

Inceid commented 2 years ago

Hello Dr. Raphael and others!

I'm currently running DeCiFer on somatic SNV calls for a patient with a large number of samples. I wanted to ask for some guidance on how to properly interpret DeCiFer's output for downstream analysis steps, especially phylogenetic analysis of the tumor's mutation history.

I have several questions below.

  1. DCF Metrics. 1a. In the _output.tsv file, several DCF metrics are reported, including true_cluster_DCF and point_estimate_DCF. How is point_estimate_DCF calculated for each mutation? (My guess is that it's something like the mean of the posterior DCF distribution in Equation (30), but I'm not sure.) 1b. Which values do you recommend using as the most reliable values for mutational prevalence for downstream analysis?

  2. I suspect the cluster assignment of some of our SNVs is not reliable. This is because these mutations have a similar negative log-likelihood value for their ideal cluster compared to other clusters they could be assigned. For example, one mutation is assigned to a cluster '7' with LH of 2523.34, but could be assigned to cluster '15' with a LH of 2523.66. I assume this is calculated roughly as the negative log-likelihood of Equation (33) in the paper. My question is this: in your opinion, what is a reasonable score difference threshold for asserting that one cluster assignment is "better" than all other possible assignments?

  3. Since our patient has 18 samples, DeCiFer starts with 20 clusters. (18 plus the 'truncal' and 'absent' cluster), even if I ask for it to conduct model selection for k=2 to 40 clusters. The model returns 12 clusters in the final answer. However, when I look at the _model_selection.tsv output, elbow scores are only shown for k=20 to 40. Is it possible for me to see the elbow scores for lower values as well?

  4. I am also concerned about the above behavior, as nothing in the paper specifies how DeCiFer removes clusters below the minimum number (p+2). The model selection procedure appears to only test p+2 clusters and higher values. For me, p+2 = 20. But somehow I'm getting 12 clusters in my output. Is this intended behavior?

  5. I'm skeptical about the results that I am getting in our copy number state trees. Essentially, our tumor has several copy number aberrations, but according to DeCiFer's state trees, none of them cause SNVs to ever gain copies. In other words, for all SNV state trees, copy number state (x, y, 2) does not exist for any x, y. I'm not sure why this is the case, as we have hundreds of SNVs in copy number aberrated regions. I would expect the mutated allele to suffer a copy gain at least some of the time. Do you have any recommendations for debugging and/or testing whether I did something wrong here?

Thanks for your time! Suraj

brian-arnold commented 2 years ago

Hi Suraj, I'll chime in and see if I can answer a few of your questions. I copied your questions below and my answers follow.

DCF Metrics. 1a. In the _output.tsv file, several DCF metrics are reported, including true_cluster_DCF and point_estimate_DCF. How is point_estimate_DCF calculated for each mutation? (My guess is that it's something like the mean of the posterior DCF distribution in Equation (30), but I'm not sure.) 1b. Which values do you recommend using as the most reliable values for mutational prevalence for downstream analysis?

The most reliable value is "true_cluster_DCF". This corresponds to the best estimate of the mutation cluster center, expressed as a DCF value or CCF value if --ccf is used, and it's calculated via equation S24. "point_estimate_DCF" is calculated using Equation 7 in the paper, and they can be quite noisy. Importantly, SNVs are NOT assigned to clusters based on their "point_estimate_DCF" values. Instead SNVs are assigned to whatever cluster that has a center that is most consistent with its VAF, where the DCF cluster center is converted to a VAF via equation S21 and this cluster center VAF is then used in a binomial/betabinomial model to compute the likelihood of the SNV's VAF ("B(...)" equation S23).

I suspect the cluster assignment of some of our SNVs is not reliable. This is because these mutations have a similar negative log-likelihood value for their ideal cluster compared to other clusters they could be assigned. For example, one mutation is assigned to a cluster '7' with LH of 2523.34, but could be assigned to cluster '15' with a LH of 2523.66. I assume this is calculated roughly as the negative log-likelihood of Equation (33) in the paper. My question is this: in your opinion, what is a reasonable score difference threshold for asserting that one cluster assignment is "better" than all other possible assignments?

This is indeed the negative log-likelihood of the assignment probability. If x1 is the best assignment (in negative log units) and x2 is the second best assignment, I've been using exp(-x1)/(exp(-x1) + exp(-x2)) to quantify the quality of the assignment. If it's ~0.5 this is bad. What is "reasonable" is a bit arbitrary and up to you, but I've been using SNVs with a value of 0.9 or above.

Since our patient has 18 samples, DeCiFer starts with 20 clusters. (18 plus the 'truncal' and 'absent' cluster), even if I ask for it to conduct model selection for k=2 to 40 clusters. The model returns 12 clusters in the final answer. However, when I look at the _model_selection.tsv output, elbow scores are only shown for k=20 to 40. Is it possible for me to see the elbow scores for lower values as well?

In every set of tumor samples, DeCiFer assumes there's a truncal cluster (DCF = 1.0) and p sample-specific clusters (p is the number of samples), where these clusters contain SNVs that are unique to a particular sample and absent in all other samples. This is why p+2 is the minimum number of clusters, and decifer then performs model selection to see how many additional clusters, beyond the minimum number, might explain the data. So, elbow scores for fewer clusters do not exist.

I am also concerned about the above behavior, as nothing in the paper specifies how DeCiFer removes clusters below the minimum number (p+2). The model selection procedure appears to only test p+2 clusters and higher values. For me, p+2 = 20. But somehow I'm getting 12 clusters in my output. Is this intended behavior?

I believe this may be due to some of the sample-specific clusters not getting any mutations assigned to them, so they aren't printed out. For instance, if you have many samples from a tumor and some of these samples are very similar and have NO unique mutations within them, then the sample-specific cluster decifer has for that sample gets no SNVs assigned to it. You can confirm that this is happening by looking at "cluster_index" column of the "*_clusterCIs.tsv" file. Are these indices consecutive, or are numbers within the [0-12] range missing?

I'm skeptical about the results that I am getting in our copy number state trees. Essentially, our tumor has several copy number aberrations, but according to DeCiFer's state trees, none of them cause SNVs to ever gain copies. In other words, for all SNV state trees, copy number state (x, y, 2) does not exist for any x, y. I'm not sure why this is the case, as we have hundreds of SNVs in copy number aberrated regions. I would expect the mutated allele to suffer a copy gain at least some of the time. Do you have any recommendations for debugging and/or testing whether I did something wrong here?

Hmm interesting. I'm not entirely sure what's happening here but my best guess is that each state tree (e.g. (x,y,0) and (x,y,2)) explain the data equally well, and in these situations decifer always selects the first genotype (x,y,0). How large are your input files? Could you attach them here?

Inceid commented 2 years ago

Hi Brian,

Thanks for the reply! I wanted to address your responses. I will quote them when appropriate.

  1. DCF Metrics.

    The most reliable value is "true_cluster_DCF". This corresponds to the best estimate of the mutation cluster center, expressed as a DCF value or CCF value if --ccf is used, and it's calculated via equation S24. "point_estimate_DCF" is calculated using Equation 7 in the paper, and they can be quite noisy. Importantly, SNVs are NOT assigned to clusters based on their "point_estimate_DCF" values. Instead SNVs are assigned to whatever cluster that has a center that is most consistent with its VAF, where the DCF cluster center is converted to a VAF via equation S21 and this cluster center VAF is then used in a binomial/betabinomial model to compute the likelihood of the SNV's VAF ("B(...)" equation S23).

Just to make sure, I should use the first value, correct? Since the results for true_cluster_DCF are reported as cluster center; (lower cluster CI, upper cluster CI), I assume cluster center - the first value - is the one to use. Thanks for the additional clarification on how SNVs are assigned to clusters.

Another question - should I expect the cluster center value to be between lower cluster CI and upper cluster CI? This is not the case for several SNVs. I have values like 1.0;(0.975,0.9783), for example.

  1. Log-likelihood

    This is indeed the negative log-likelihood of the assignment probability. If x1 is the best assignment (in negative log units) and x2 is the second best assignment, I've been using exp(-x1)/(exp(-x1) + exp(-x2)) to quantify the quality of the assignment. If it's ~0.5 this is bad. What is "reasonable" is a bit arbitrary and up to you, but I've been using SNVs with a value of 0.9 or above.

Understood! I'll see if I can adapt your guidelines to my SNVs.

  1. Cluster indices

    I believe this may be due to some of the sample-specific clusters not getting any mutations assigned to them, so they aren't printed out. For instance, if you have many samples from a tumor and some of these samples are very similar and have NO unique mutations within them, then the sample-specific cluster decifer has for that sample gets no SNVs assigned to it. You can confirm that this is happening by looking at "cluster_index" column of the "*_clusterCIs.tsv" file. Are these indices consecutive, or are numbers within the [0-12] range missing?

This seems like a reasonable explanation. I just checked my _clusterCIs.tsv output and the indices are all consecutive from 0 to 11. So I suppose you're right - 8 of the 20 clusters are just empty.

  1. CN States

    Hmm interesting. I'm not entirely sure what's happening here but my best guess is that each state tree (e.g. (x,y,0) and (x,y,2)) explain the data equally well, and in these situations decifer always selects the first genotype (x,y,0). How large are your input files? Could you attach them here?

Sure! I'm attaching the input to DeCiFer for this patient. All mutations are somatic. There are 692 mutations total. 111e6e61e1.purity.txt decifer.input.txt

Inceid commented 2 years ago

Also, where are equations S21-S24 in the paper? I do not see them in the main paper nor the supplemental.

Edit: I found the equations here in the biorxiv preprint from February 27, 2021, but not in the main Cell Systems manuscript or supplement. Are these the equations (S21-S24) you're referring to? Just wanted to make sure there isn't a later updated version.

brian-arnold commented 2 years ago

Hi Suraj,

Apologies for the confusions! I was indeed referring to equations in the supplement of the article posted on biorxiv. I referred to equations S24 and S23, which correspond to equation 31 and and 30, respectively, in the supplement of the published article.

Could I ask you to rerun decifer, modifying your input in two ways:

(1) this is optional and not strictly required... but for your column under character_index , you label mutations in a way that includes a space. Glancing at the code I don't think this should have an effect, but this isn't the way we recommend input files be formatted.

(2) Did you use the script we provide to generate your input file? If you look at your input file after the columns "ref" and "var", you're supposed to provide information on allele specific copy number and proportions. While you do this, you'll notice that you have duplicate entries for some copy number states. Looking at the first after the header, for a specific mutation in a specific sample, I see multiple entries for the copy number state 1|1. These should be collapsed, and all clones that have the same copy number state need to get collapsed into a single proportion, which we automatically do in the script we provide to generate input files.

Brian

Inceid commented 2 years ago

Hi Brian,

Understood, thanks for clarifying!

(1) I will do so today, thanks for letting me know.

(2) Are you referring to vcf_to_decifer.py? I did not use this script - I manually generated the input. Perhaps it's better to use that script. I'll do so today.

Inceid commented 2 years ago

Hi Brian,

Mutation calling is still running - it will take a while for a patient with this many samples.

In the meantime, I manually fixed the issues that you mentioned (remove the space character, merge duplicate copy number states, et cetera). I re-ran DeCiFer and am posting the results - input and output - here. They seem to look better/more realistic, but I'm wondering what you think.

111e6e61e1.decifer.input.txt 111e6e61e1-out_clusterCIs.txt 111e6e61e1-out_model_selection.txt 111e6e61e1-out_output.txt

Thanks!

brian-arnold commented 2 years ago

They look decent! I do think there may be too many clusters though. For instance, cluster 29 has the most SNVs of any cluster, and they all have high DCF values (0.75-0.85). I suspect these are SNVs that should have been assigned to truncal cluster (cluster 1). To do this, you could try increasing the elbow parameter to something very high (e.g. 0.6 or 0.9). I also suggest using the --printallk option to just have decifer print the output for all values of K it explored, just to get an idea without having to rerun decifer many times with different elbow parameter values.

Inceid commented 2 years ago

Hi Brian,

To follow up, I wanted to ask a couple of quick questions about your preprocessing for DeCiFer as I start adding more samples to our analysis. I'm trying to adapt vcf_2_decifer.py to our own file formats due to convenience on our end.

  1. HATCHet sometimes gives copy number calls of 0,0 for the maternal/paternal allele in some segments. Several such calls have a prevalence greater than 0.0. We use HATCHet's calls as input for DeCiFer. These clones should be removed during preprocessing, right? It wouldn't make sense to have read data for an SNV but also have 0,0 as the copy number state.

  2. Do you automatically remove copy number clones with a prevalence of 0.0 during preprocessing? Does it make a difference to DeCiFer?

  3. We were thinking of running DeCiFer in a less strict mode and re-clustering groups of mutations after the fact to avoid overclustering certain mutations that we want to validate. What elbow criterion would you recommend for a "lax" DeCiFer clustering? I was thinking of using 0.01.

brian-arnold commented 2 years ago

Hi Suraj, apologies for the delayed reply.

  1. if the (0,0) copy number state is subclonal, then presumably you could have SNVs in that region if the other subclone has a copy number >= 1. But you're correct that it wouldn't make sense to have a somatic SNV in these regions, unless it's normal contamination. At the moment we don't remove (0,0) copy number states, but we probably should if the (0,0) state has a proportion that's equal to tumor purity, as this is either a bad copy number call or if it is accurate, we wouldn't expect SNVs here as you mention.
  2. We don't remove copy number clones with proportion 0.0, but we do combine identical copy number states at a locus, so if a particular copy number state was represented twice, once with a proportion of 0.0 for whatever reason, then it would effectively get removed.
  3. Unfortunately what a "lax" elbow criteria is may vary with the dataset, but I've used values as low as 0.002, then using the --printallk option to see what the results are if I were to use other elbow criteria. We are currently looking into other model selection critera, other than the elbow method.