Open amcpherson opened 7 years ago
Hi, thank you for your question.
As a preprocessing step, THetA clusters intervals together into "meta-intervals" that are estimated to have the same copy number profile. It does this iteratively, first by clustering intervals for each chromosome, then across the entire genome. The situation that you're running into indicates that all the input intervals had only one cluster. Thus, in your case, either your data doesn't have any large intervals with CNAs, or we are overclustering and merging intervals that should be distinct (or something else is going wrong).
Without the allele counts, we do not cluster so it would make sense that you do not run into this issue running without them, but if there were no copy number abberations, then the resulting solution may simply overfit the data.
We can visually inspect to see what is happening in your case. The clustering outputs a file {PREFIX}_assignment.png. This is a plot of read-depth ratio (x-axis) and average allele frequency deviation (distance from 0.5) on the y-axis. Each point is an interval. The color indicates the cluster that each mutation was assigned to. I attached an example below:
If it looks like there should indeed be more than one cluster, please let me know and I can look into why clustering is failing to produce reasonable results.
On some datasets THetA appears to exclude most intervals and fails as a result. Output:
Note that this only occurs if I am providing allele counts. With just the intervals it works fine.
Any suggestions or debug output to look at would be helpful.