morrislab / pairtree

Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
MIT License
33 stars 10 forks source link

Selecting a solution based on negative log-likelihood #24

Closed ahgillmo closed 1 year ago

ahgillmo commented 2 years ago

Hello, pair tree I am curious about the concentration parameter and its relationship with the number of populations and the negative log-likelihood (NLL). I ran a pairtree with different concentrations and tracked the output statistics. When I decrease the concentration I often get decreased number of populations and various NLL values, however, I don't know which solution to select.

My questions are:

  1. Can I compare NLL values between runs of pairtree. For example, G01 is run with Concentrations of -2, -3 and -4. Is the NLL of 23.34067994 (concentration -2) better than the concentration of -3 and -4 (~27 NLL)?
  2. Can I compare the NLL values between different runs of pairtree. For example between G01, G02 and G03. ~24 for G01 vs ~4.7 for G03. 2b. Is there any recommendation on a cut-off for a negative log-likelihood? Does a NLL of 60 mean the solution is unreliable vs an NLL of 5?

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Sample | Concentration | NumSamples | NumClones | SampleType | NumberMutations | TopTree_nll (low is better) | TopTree_SoftMax_prob (high is better) -- | -- | -- | -- | -- | -- | -- | -- G01 | -2 | 2 | 6 | Long | 2390 | 23.34067994 | 0.146525601 G01 | -3 | 2 | 5 | Long | 2390 | 27.40725892 | 0.504265007 G01 | -4 | 2 | 5 | Long | 2390 | 27.12646534 | 0.504265007 G02 | -2 | 4 | 12 | Long | 407 | 26.55429561 | 0.517743722 G02 | -3 | 4 | 10 | Long | 407 | 24.67445361 | 0.475983227 G02 | -4 | 4 | 9 | Long | 407 | 27.76296263 | 0.475983227 G03 | -2 | 3 | 5 | Long | 404 | 4.753076532 | 0.491227906 G03 | -3 | 3 | 3 | Long | 404 | 5.133140314 | 0.983822521 G03 | -4 | 3 | 3 | Long | 404 | 5.143257042 | 0.983822521

Thank you very much for your time. Aaron

ethanumn commented 2 years ago

Hi Aaron -

I assume that you’re talking about the --concentration parameter for clustervars. Both of the models used by clustervars to group mutations into subclones are Dirichlet Process mixture models (DPMM). They both contain a parameter set by the user called the concentration (in equations for a DPMM you’ll see \alpha). When clustering mutations we compute the probability that a mutation fits into one of the existing clusters, or if it should be placed into a new cluster. The probability it is placed into a new cluster is proportional to the size of our concentration i.e. a larger concentration increases the probability we place the mutation into a new cluster, a lower concentration decreases the probability we place the mutation into a new cluster.

Based on this, it makes sense you are getting a smaller number of subclones as your concentration is decreased.

Now to answer your specific questions:

  1. Yes, you can compare the negative log-likelihood of the trees constructed during different runs of Pairtree when using the same data. It’s important to note though that you are changing the clustering of mutations between runs of Pairtree, which can heavily impact the clone trees that are constructed. Although the negative log-likelihood may be higher for a tree with more subclones, it doesn’t necessarily mean that tree is worse. It may just mean that the clustering of mutations was poor, or that a particularly noisy mutation is not being placed into its own cluster after you decrease the concentration parameter. Depending on your data, it might be correct to place a mutation with noisy data into its own cluster (i.e. having more subclones makes sense).

  2. (a) Probably not -- but I’ll elaborate further. Let’s say you have a single dataset with which you cluster mutations and then construct clone trees. After obtaining your results, you decide to cluster mutations and construct clone trees using a subset of the original dataset. It’s almost guaranteed that your negative log-likelihood with be lower using the subset of your original data simply because you’re using less data. This is because we compute the negative log likelihood using the binomial probability for each subclone after adjusting its subclonal frequency to fit the tree. Each subclone has a certain number of variant and reference reads associated with it (they are equal to the sum of the variant and reference reads for all the mutations the subclone contains). As the total read count for a subclone increases, so does the variance for its binomial distribution. This results in a lower probability density for any given variant read read count (though it decreases the variance of what is a likely variant read count for the subclonal frequency). This idea similarly applies when trying to compare trees from two different datasets. (b) As I described in part (a), it might not make sense to assign a cutoff for the negative log likelihood because it will change depending on the size of the dataset. Therefore, it is recommended that you generate some baseline results and then adjust parameters of the pairtree algorithm (e.g. increase the number of trees sampled), or adjust the clustering of the mutations. I think the approach you’ve taken makes a lot of sense, and looking at the trees constructed and inspecting the mutations in each cluster can inform you how you might want to modify your inputs or parameters to improve your results.