too many models across samples

ysbioinfo commented 7 years ago

Hi I am using clonevol to analyze the output from pyclone. It is a WES data, and the read depth is not deep enough, so I filtered all of the clusters with <10 mutations. However, when I run the clonevol, it infers too many models across samples and I do not know how to choose. It says there are 3123 models, is there always such many models inferred for your data? And what is the meaning of 5 unique trees? I just want to plot a most-likely-phylogenic tree. Thanks. Attached is my pairwise figure. Below is my output. Thanks!

Sample 1: HCC772_1_1 <-- HCC772_1_1 Sample 2: HCC772_1_3 <-- HCC772_1_3 Sample 3: HCC772_2_1 <-- HCC772_2_1 Sample 4: HCC772_2_2 <-- HCC772_2_2 Sample 5: HCC772_2_3 <-- HCC772_2_3 Sample 6: HCC772_3_1 <-- HCC772_3_1 Sample 7: HCC772_3_2 <-- HCC772_3_2 Using monoclonal model Note: all VAFs were divided by 100 to convert from percentage to proportion. Generating non-parametric boostrap samples... HCC772_1_1 : Enumerating clonal architectures... Determining if cluster VAF is significantly positive... Exluding clusters whose VAF < min.cluster.vaf=0.05 Non-positive VAF clusters: 9,10,5,4,6,7 HCC772_1_1 : 26 clonal architecture model(s) found

HCC772_1_3 : Enumerating clonal architectures... Determining if cluster VAF is significantly positive... Exluding clusters whose VAF < min.cluster.vaf=0.05 Non-positive VAF clusters: 9,8,5,6,4,7 HCC772_1_3 : 27 clonal architecture model(s) found

HCC772_2_1 : Enumerating clonal architectures... Determining if cluster VAF is significantly positive... Exluding clusters whose VAF < min.cluster.vaf=0.05 Non-positive VAF clusters: 7,6,4,5,10,8 HCC772_2_1 : 23 clonal architecture model(s) found

HCC772_2_2 : Enumerating clonal architectures... Determining if cluster VAF is significantly positive... Exluding clusters whose VAF < min.cluster.vaf=0.05 Non-positive VAF clusters: 6,8,10,5,7,9 HCC772_2_2 : 31 clonal architecture model(s) found

HCC772_2_3 : Enumerating clonal architectures... Determining if cluster VAF is significantly positive... Exluding clusters whose VAF < min.cluster.vaf=0.05 Non-positive VAF clusters: 8,6,10,4,5,9 HCC772_2_3 : 26 clonal architecture model(s) found

HCC772_3_1 : Enumerating clonal architectures... Determining if cluster VAF is significantly positive... Exluding clusters whose VAF < min.cluster.vaf=0.05 Non-positive VAF clusters: 7,4,10,8,5,9 HCC772_3_1 : 28 clonal architecture model(s) found

HCC772_3_2 : Enumerating clonal architectures... Determining if cluster VAF is significantly positive... Exluding clusters whose VAF < min.cluster.vaf=0.05 Non-positive VAF clusters: 7,4,10,9,6,8 HCC772_3_2 : 17 clonal architecture model(s) found

Finding matched clonal architecture models across samples... Found 3123 compatible model(s) Merging clonal evolution trees across samples... Found 3123 compatible evolution models Pruning merged clonal evolution trees.... Number of unique pruned trees: 5 Scoring models... 2823 model(s) with p-value <= 0.01_ variants.pairwise.plot.scatter.1-page.pdf

ysbioinfo commented 7 years ago

Also attach a flow figure of the vafs. Hope it help you to figure out the problem in my data. You can see that, the heterogeneity is extremely high for my data. Almost each sample have a cluster of mutations that is unique for it. How should I deal with this kind of data? Thanks! flow.pdf

hdng commented 7 years ago

This is due to the flexibility in the placement of the low frequency subclones private to a sample when ClonEvol performs clonal ordering. Those private subclone placements (often) does not have to agree across samples (since they are found in only one sample). Let's say each sample has two different models differing in only the private subclone placement, you'll have 2^N models (N=7 samples), hence you can see why the number of models are enormous, and they are all consistent with the data.

The good thing is the large number of models you found only represent 5 pruned trees (which leave the private subclones out) for you to interpret inter-sample relationship without worrying about private events.

A caveat, this could be due to noisy/errors in your data as well, so I would try to refine the clustering more to see if it cleans up. The variant.box.plot function is extremely useful to dig deeper into the clustering.

ysbioinfo commented 7 years ago

Thanks for your fast reply! To be honest, I don't know how to refine the clusters more besides removing the small clusters, could you give some advice? For example, the clonevol cannot find a model across samples when I input this data (attached). I plot its box plot and flow figure, however, I do not know how to choose the clusters based on there figures. Would you be so kind to check these two figure and tell me what can be done to get a reasonable model across samples? 1239_box_plot.pdf 1239_flow.pdf HCC1239.sorted.filtered.vaf.clonevol.txt

hdng commented 7 years ago

Your cluster looks okay although still need clean-up. I didn't notice that you use Pyclone. Did you take Pyclone CCF and divide it by two to get equivalent CN-scaled VAF? There is a trick you need to do with Pyclone to get the best results with clonevol. Also see this https://github.com/hdng/clonevol/issues/4.

ysbioinfo commented 7 years ago

I just use the original vaf calculated by samtools. I will try this trick to see if it works. Thanks!

hdng / clonevol

too many models across samples #6