raphael-group / THetA

Tumor Heterogeneity Analysis (THetA) and THetA2 are algorithms that estimate the tumor purity and clonal/subclonal copy number aberrations directly from high-throughput DNA sequencing data. This repository includes the updated algorithm, called THetA2.
http://compbio.cs.brown.edu/projects/theta/
70 stars 33 forks source link

THetA cannot find many intervals for analysis #15

Open amcpherson opened 7 years ago

amcpherson commented 7 years ago

On some datasets THetA appears to exclude most intervals and fails as a result. Output:

    =================================================
    Arguments are:
            Query File: /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/theta_input.seg
            k: 3
            tau: 2
            Output Directory: /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work
            Output Prefix: theta_results
            Num Processes: 1
            Graph extension: .pdf
            Force: True

    Valid sample for THetA analysis:
            Ratio Deviation: 0.1
            Min Fraction of Genome Aberrated: 0.05
            Program WILL cluster intervals.
    =================================================
    Reading in query file...
    Frac with potential copy numbers: 0.364559375662
    Reading SNP file at /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/normal_alleles.tsv
    Reading SNP file at /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/tumour_alleles.tsv
    Reading interval file at /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/theta_input.seg
    Calculating BAFs
    Determining heterozygosity.
    Calculating BAFs.
    First round of clustering...
    Begin meta clustering...
    Classifying clusters...
    Plotting classifications...
    Determining copy number bounds...
    Plotting clusters...
    Selecting meta-intervals...
            Selected 1 intervals for analysis.
    Preprocessing data...
    Writing bounds file to /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/theta_results.n2.withBounds
    Estimating time...
            Estimated Total Time: 0 second(s)
    Performing optimization...
    Writing results file to /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/theta_results.n2.results
    Plotting results as a .pdf...
    Writing script to run N=3 to  /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/theta_results.RunN3.bash
    Frac with potential copy numbers: 0.364559375662
    Reading SNP file at /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/normal_alleles.tsv
    Reading SNP file at /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/tumour_alleles.tsv
    Reading interval file at /genesis/shahlab/amcpherson/remixt_data/tmp/comparison/comparison_3/tmp/sim_id/test_proportion_subclonal_2_2/tool_name/theta/create_tool_workflow/sample_id/tumour/theta_work/theta_input.seg
    Calculating BAFs
    Determining heterozygosity.
    Calculating BAFs.
    First round of clustering...
    Begin meta clustering...
    Classifying clusters...
    Plotting classifications...
    Determining copy number bounds...
    Plotting clusters...
    Selecting meta-intervals...
            Selected 0 intervals for analysis.
    --- stderr ---
    Traceback (most recent call last):
      File "/genesis/shahlab/amcpherson/anaconda/envs/remixtman/src/theta/bin/../python/RunTHetA.py", line 504, in <module>
        main()
      File "/genesis/shahlab/amcpherson/anaconda/envs/remixtman/src/theta/bin/../python/RunTHetA.py", line 284, in main
        resultsfile3, boundsfile3 = run_fixed_N(3, args, intervals, resultsfile2)
      File "/genesis/shahlab/amcpherson/anaconda/envs/remixtman/src/theta/bin/../python/RunTHetA.py", line 366, in run_fixed_N
        select_meta_intervals_n3(lengths, tumorCounts, normCounts, m, k, force, num_intervals, cluster_scores, lower_bounds, upper_bounds)
      File "/shahlab/amcpherson/anaconda/pkgs/theta-0.7-1/src/theta/python/SelectIntervals.py", line 195, in select_meta_intervals_n3
        return [column(topLines, i) for i in range(len(topLines[0])-1)]
    IndexError: list index out of range

Note that this only occurs if I am providing allele counts. With just the intervals it works fine.

Any suggestions or debug output to look at would be helpful.

gsatas commented 7 years ago

Hi, thank you for your question.

As a preprocessing step, THetA clusters intervals together into "meta-intervals" that are estimated to have the same copy number profile. It does this iteratively, first by clustering intervals for each chromosome, then across the entire genome. The situation that you're running into indicates that all the input intervals had only one cluster. Thus, in your case, either your data doesn't have any large intervals with CNAs, or we are overclustering and merging intervals that should be distinct (or something else is going wrong).

Without the allele counts, we do not cluster so it would make sense that you do not run into this issue running without them, but if there were no copy number abberations, then the resulting solution may simply overfit the data.

We can visually inspect to see what is happening in your case. The clustering outputs a file {PREFIX}_assignment.png. This is a plot of read-depth ratio (x-axis) and average allele frequency deviation (distance from 0.5) on the y-axis. Each point is an interval. The color indicates the cluster that each mutation was assigned to. I attached an example below:

example_assignment

If it looks like there should indeed be more than one cluster, please let me know and I can look into why clustering is failing to produce reasonable results.