hdng / clonevol

Inferring and visualizing clonal evolution in multi-sample cancer sequencing
GNU General Public License v3.0
142 stars 45 forks source link

How to set founding.cluster? #4

Closed crazyhottommy closed 6 years ago

crazyhottommy commented 7 years ago

Hi, I have several questions on how to set the parameters for the infer.clonal.models function. First, the help page does not explain all the arguments. e.g. how do I set the founding.cluster? in the github page example, there is a cluster.center argument, but I do not see it in the help.

How many times of bootstrap usually is enough? default is 1000, and for around 300 mutations, it is taking several hours to run, and not finish yet..

Thanks for giving more information.

Best, Tommy

hdng commented 7 years ago

Assuming this is monoclonal cancer initiation, the founding.cluster is often the cluster with the highest CCF (or VAF) in most (or in ideal situation, all) samples. 1000 bootstraps is reasonable for first run. The number of samples and the number of clusters affect the performance time more. How many clusters/samples do you have for your case? You may need some clean-up (eg. removing small or noise clusters). Also, are there many clusters with very small CCF? That may increase the number of models to consider.

crazyhottommy commented 7 years ago

Thanks! I only have two samples, but do have smaller clusters. I removed the cluster with very few mutations (what's the cut-off? , I removed clusters with only 1 mutation).

it finished much sooner, but none model were found even at p < 0.1.

what is alpha for?

x <- infer.clonal.models(variants= clonevol_input,
            cluster.col.name="cluster",
            model = "polyclonal",
            vaf.col.names=vaf.col.names,
            subclonal.test="bootstrap",
            subclonal.test.model="non-parametric",
            cluster.center="mean",
            num.boots=1000,
            founding.cluster= NULL,
            min.cluster.vaf=0.01,
            p.value.cutoff= 0.1,
            alpha= 0.5,
            random.seed=63108)

Sample 1: Pa26T1.vaf <-- Pa26T1.vaf
Sample 2: Pa26T2.vaf <-- Pa26T2.vaf
Using polyclonal model
Note: all VAFs were divided by 100 to convert from percentage to proportion.
Generating non-parametric boostrap samples...
Pa26T1.vaf : Enumerating clonal architectures...
Determining if cluster VAF is significantly positive...
Exluding clusters whose VAF < min.cluster.vaf=0.001
Non-positive VAF clusters:  
Pa26T1.vaf : 910 clonal architecture model(s) found

Pa26T2.vaf : Enumerating clonal architectures...
Determining if cluster VAF is significantly positive...
Exluding clusters whose VAF < min.cluster.vaf=0.001
Non-positive VAF clusters:  
Pa26T2.vaf : 376 clonal architecture model(s) found

Finding matched clonal architecture models across samples...
Found  0 compatible model(s)
Found 0 compatible evolution models
Pruning merged clonal evolution trees....
Number of unique pruned trees: 0 
Scoring models...
0 model(s) with p-value <= 0.1 

***WARN: Inra-tumor heterogeneity could result in a clone (eg. founding)
         that is not present (no cells) in any samples, although  detectable via
         clonal marker variants due to that its subclones are distinct across
         samples. Therefore, a model with a higher p-value for the CCF of such
         a clone can still be biologically consistent, interpretable, and
         interesting! Manual investigation of those higher p-value models
         is recommended
hdng commented 7 years ago

The cutoff depends on how confident you are on the clusters, eg. if your data are deep enough such that you can trust a cluster of >=5 variants, then use 5. However, this is a bit arbitrary though, and there is no formula for this. I would visualize the clusters first (eg. a pairwise plot of CCF of the variants grouped by clusters for the two samples), and check if the clusters look reasonable. Also, it seems that your scaled VAF is from 0-1. The default for clonevol is 0-100 (percentage). This probably does not affect the underlying inference, but it is good to use 1-100 or set vaf.in.percent=FALSE when calling infer.clonal.models. You should also set alpha and p.value.cutoff the same, and try lower value, eg. 0.01.

crazyhottommy commented 7 years ago

Thanks! I did scale variant_prevalence by *100/2 to get VAF ranging from 0-50. so alpha and p.value should be the same? what alpha is?

Even after playing with different arguments, I still do not have any matched clonal architecture models across samples. what should I do with them?

I know the paper is under review, I would like to read to understand more of the tool.

Thanks, Tommy

hdng commented 7 years ago

Visualization of the clusters would help identifying problems with the clustering, eg. is there an outlier or abnormal cluster that conflicts between samples or shows sign of polyclonal cancer initiation? Clonevol does require a reasonably good clustering to start with. Alpha is the significant threshold to identify if a clone exists in a sample. I am working on the manual and technical document (takes too long already sorry), and the manuscript should be posted on bioRxiv soon too. Stay tuned. Thanks.

crazyhottommy commented 7 years ago

looking forward to read it. at the same time, I attached a cluster VAF scatter plot. clonevol does not output any model, you see any problems with it? Thanks!

Tommy

Pa30T1_vs_Pa30T2.pdf

hdng commented 7 years ago

Looks a little odd. Is this the scaled VAF based on CCF estimated via Pyclone? Should the highest VAF be near 50%?

crazyhottommy commented 7 years ago

sorry, when plot the VAF scatter plot, I used the raw output from pyclone, I did not *100/2 to get the scaled VAF. but if I scale it, it will be the same pattern but just different scale on X and Y axis.

However, when I put the data into clonevol, I did scale to 0-50. So, the pattern of the VAF scatter plot look odd ?

Thanks very much for your help!

hdng commented 7 years ago

Sorry for slow response. I think some models can be inferred, but you need a tweak.

1) You can try plotting the models using plot.clonal.models function. If no matched model is there, clonevol will plot all models for individual samples. This helps debugging what clusters are not compatible across samples.

2) Looks like pyclone put a hard cut on the CCF at 100% due to their statistical model. It seems that tumor purity does not affect their model, so I think if you divide your original VAF by 2 to fake that the tumor purity is always < 50%, thus posterior CCF estimated by pyclone is maxed at 50% (equivalent to 100%), while still allowing some to go over 50%, this would make the scaled VAF looks reasonable, and you can multiply scaled VAF by 2 to get back to the original purity, and run clonevol. Don't forget to plot CCF or VAF like you did.

Are their many copy altered variants? If not, you can try to cluster by SciClone as well.

crazyhottommy commented 7 years ago

Thank you for your answer.

  1. those two samples are from the same tumor of the same patient , so what's wrong if no compatible models are inferred? I plot the models individually. please see attachment. how to explain it?

  2. you are saying I need to divide the VAF (calculated from samtools mpileup) by 2 before feeding into pyclone? then multiply pyclone output CCF by 2 to run clonevol? what's the advantage to do this?

I also attached the scaled VAF (pyclone CCF *100/2) for you to test if you can.

Thanks again!

model-Pa30T2.vaf.pdf model-Pa30T1.vaf.pdf clonevol_input.txt

hdng commented 7 years ago
  1. Assuming that the two samples have the same origin, they must have the same clonal structure (tree).
  2. Imagine a variant with VAF=60% (no CN) will be estimated to have CCF=100% by pyclone, thus when converting back to VAF giving you scaled VAF=50%. If you divide VAF by 2 prior to pyclone analysis (instead of dividing after pyclone analysis), then use CCF from pyclone as VAF, you'll get scaled VAF = 60%. Let me know if this works. I am happy to dig deeper to find the best strategy for analyzing pyclone results.
crazyhottommy commented 7 years ago

My data have a lot of copy number changes, so I did not use sciclone. For pyclone, one has to provide both the tumor purity (inferred from other tools such as sequenza) and the VAF to the tool. so the end results of CFF always ranges from 0 to 1.

hdng commented 7 years ago

What CCF range you'll get if you specify tumor purity of 100% and also divide VAF by 2?

crazyhottommy commented 7 years ago

CCF will be always from 0 to 1. pyclone uses the tumor purity and copynumber information in addition to VAF to identify clusters.

hdng commented 7 years ago

Yes it will be from 0-1, but do you see a smooth range when plotting or a hard cut at 1? If it is smooth, then you can use clonevol.