raphael-group / THetA

Tumor Heterogeneity Analysis (THetA) and THetA2 are algorithms that estimate the tumor purity and clonal/subclonal copy number aberrations directly from high-throughput DNA sequencing data. This repository includes the updated algorithm, called THetA2.
http://compbio.cs.brown.edu/projects/theta/
71 stars 33 forks source link

AssertionError #7

Open SNRNS opened 8 years ago

SNRNS commented 8 years ago

I successfully analysed the example files with RunTheta, however when I tried to analyse my sample it produces an error. I have the normal and primary tumour whole exome sequencing bam files. I created the Theta input file and the snp.withCounts files, one for normal and one for tumour files following the instructions. However when I run in the THetA-master directory:

bin/RunTHetA normal_primary.input --TUMOR_FILE BAF_OUTPUT/primary_snp.withCounts --NORMAL_FILE BAF_OUTPUT/normal_snp.withCounts

I get:

=================================================
Arguments are:
       Query File: normal_primary.input
       k: 3
       tau: 2
       Output Directory: ./
       Output Prefix: normal_primary
       Num Processes: 1
       Graph extension: .pdf

Valid sample for THetA analysis:
       Ratio Deviation: 0.1
       Min Fraction of Genome Aberrated: 0.05
       Program WILL cluster intervals.
=================================================
Reading in query file...
Frac with potential copy numbers: 0.274523486464
Reading SNP file at BAF_OUTPUT/primary_snp.withCounts
Reading SNP file at BAF_OUTPUT/normal_snp.withCounts
Reading interval file at normal_primary.input
Calculating BAFs
Determining heterozygosity.
Calculating BAFs.
First round of clustering...
Traceback (most recent call last):
 File "/PATH/TO/.../THetA-master/bin/../python/RunTHetA.py", line 504, in <module>

   main()

 File "/PATH/TO/.../THetA-master/bin/../python/RunTHetA.py", line 282, in main

   resultsfile2, boundsfile2 = run_fixed_N(2, args, intervals)

 File "/PATH/TO/.../THetA-master/bin/../python/RunTHetA.py", line 319, in run_fixed_N

   lengths, tumorCounts, normalCounts, m, upper_bounds, lower_bounds, clusterAssignments, numClusters, clusterMeans, normalInd = clustering_BAF(n, intervals=intervals, missingData=missingData, prefix=prefix, outdir=directory, numProcesses=num_processes)

 File "/PATH/TO/.../THetA-master/python/ClusteringBAF.py", line 90, in clustering_BAF

   metaData = generate_meta_data(intervals, byChrm, numProcesses, sampleName, generateData, outdir)

 File "/PATH/TO/.../THetA-master/python/ClusteringBAF.py", line 143, in generate_meta_data

   results = p.map(cluster_wrapper, zip(intervals, linearizedSampleName, linearizedChrm, linearizedGenerateData))

 File "/PATH/TO/miniconda2/envs/theta2/lib/python2.7/multiprocessing/pool.py", line 251, in map

   return self.map_async(func, iterable, chunksize).get()

 File "/PATH/TO/miniconda2/envs/theta2/lib/python2.7/multiprocessing/pool.py", line 567, in get

   raise self._value

AssertionError

The normal_snp.withCounts file looks like this:

head -20 BAF_OUTPUT/normal_snp.withCounts

#Chrm   pos     A       C       G       T       total   refCount        mutCount
1       564621  0       0       0       0       0       0       0
1       564773  0       0       0       0       0       0       0
1       567753  0       0       0       0       0       0       0
1       721290  0       0       3       0       3       3       0
1       740857  0       0       0       0       0       0       0
1       752566  0       0       0       0       0       0       0
1       761732  0       0       0       0       0       0       0
1       765269  0       0       0       0       0       0       0
1       777122  0       0       0       0       0       0       0
1       785989  0       0       0       0       0       0       0
1       792480  0       0       0       0       0       0       0
1       798959  0       0       0       0       0       0       0
1       799463  0       0       0       0       0       0       0
1       888659  0       0       0       47      47      47      0
1       918573  0       0       0       0       0       0       0
1       926431  0       0       0       4       4       0       4
1       947034  0       0       0       0       0       0       0
1       949608  83      0       0       0       83      0       83

The primary_snp.withCounts looks like this:

head -20 BAF_OUTPUT/primary_snp.withCounts

#Chrm   pos     A       C       G       T       total   refCount        mutCount
1       564621  0       0       0       0       0       0       0
1       564773  0       0       0       0       0       0       0
1       567753  0       0       0       0       0       0       0
1       721290  0       0       0       0       0       0       0
1       740857  0       0       0       0       0       0       0
1       752566  0       0       0       0       0       0       0
1       761732  0       0       0       0       0       0       0
1       765269  0       0       0       0       0       0       0
1       777122  0       0       0       0       0       0       0
1       785989  0       0       0       0       0       0       0
1       792480  0       0       0       1       1       0       1
1       798959  0       0       0       0       0       0       0
1       799463  0       0       0       0       0       0       0
1       888659  0       0       0       165     165     165     0
1       918573  1       0       0       0       1       1       0
1       926431  0       0       0       0       0       0       0
1       947034  2       0       0       0       2       0       2
1       949608  191     0       0       0       191     0       191
1       990417  0       0       0       125     125     125     0

The normal_primary.input looks like this:

head normal_primary.input

#ID     chrm    start   end     tumorCount      normalCount
start_1_1:end_1_20000           1       1       20000   531     215
start_1_20001:end_1_40000       1       20001   40000   0       0
start_1_40001:end_1_60000       1       40001   60000   0       0
start_1_60001:end_1_80000       1       60001   80000   242     90
start_1_80001:end_1_100000      1       80001   100000  0       0
start_1_100001:end_1_120000     1       100001  120000  0       0
start_1_120001:end_1_140000     1       120001  140000  832     493
start_1_140001:end_1_160000     1       140001  160000  0       0
start_1_160001:end_1_180000     1       160001  180000  0       0

I would appreciate very much any help provided.

Many thanks, Alejandro

ChiragNepal commented 8 years ago

Hi Alejandro,

Can you please share how you solved the problem?

I am getting the following.

Arguments are: Query File: Test2Segment k: 3 tau: 2 Output Directory: ./ Output Prefix: Test2Segment Num Processes: 6 Graph extension: .pdf

Valid sample for THetA analysis: Ratio Deviation: 0.1 Min Fraction of Genome Aberrated: 0.05

Reading in query file... Traceback (most recent call last): File "/usr/local/src/THetA/python/RunTHetA.py", line 504, in main() File "/usr/local/src/THetA/python/RunTHetA.py", line 275, in main intervals = read_interval_file(args[0]) File "/usr/local/src/THetA/python/FileIO.py", line 400, in read_interval_file tumor_counts.append(int(line[4])) ValueError: invalid literal for int() with base 10: '38.3'

Thanks !

SNRNS commented 8 years ago

Hi @ChiragNepal,

I couldn't solve the problem. I used other tools in the end. For relative copy number alterations I used CopywriteR, for tumour purity and absolute copy numbers I used ABSOLUTE, and finally to get the cellular prevalence of the mutations I used PyClone.

I hope that helps.

Best, Alejandro

dpmccabe commented 8 years ago

I'm also having this problem. Removing multiprocessing and changing p.map to map in ClusteringBAF.py results in this error instead:

First round of clustering...
Traceback (most recent call last):
  File "/xchip/scarter/dmccabe/THetA/bin/../python/RunTHetA.py", line 504, in <module>
    main()
  File "/xchip/scarter/dmccabe/THetA/bin/../python/RunTHetA.py", line 282, in main
    resultsfile2, boundsfile2 = run_fixed_N(2, args, intervals)
  File "/xchip/scarter/dmccabe/THetA/bin/../python/RunTHetA.py", line 319, in run_fixed_N
    lengths, tumorCounts, normalCounts, m, upper_bounds, lower_bounds, clusterAssignments, numClusters, clusterMeans, normalInd = clustering_BAF(n, intervals=intervals, missingData=missingData, prefix=prefix, outdir=directory, numProcesses=num_processes)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 90, in clustering_BAF
    metaData = generate_meta_data(intervals, byChrm, numProcesses, sampleName, generateData, outdir)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 146, in generate_meta_data
    results = map(cluster_wrapper, zip(intervals, linearizedSampleName, linearizedChrm, linearizedGenerateData))
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 203, in cluster_wrapper
    mus, sigmas, clusterAssignments, numPoints, numClusters = cluster(binnedChrm, sampleName, chrm=chrm)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 259, in cluster
    Data = format_data(data, sampleName, chrm)
  File "/xchip/scarter/dmccabe/THetA/python/ClusteringBAF.py", line 302, in format_data
    Data = bnpy.data.XData(X=npArray)
  File "/xchip/scarter/dmccabe/bnpy-dev/bnpy/data/XData.py", line 97, in __init__
    self._check_dims()
  File "/xchip/scarter/dmccabe/bnpy-dev/bnpy/data/XData.py", line 121, in _check_dims
    assert self.X.flags.owndata
AssertionError

I don't have any problems running the example, for some reason. Just my own data.

BaptisteAmeline commented 6 years ago

It has been suggested in another forum that at least 1 SNP has to be present in every single interval_count. I assume that the example works perfectly because the number of interval_count is very limited compared to the number of SNP. Can someone confirm this hypothesis ? Is THetA still supported by the developper ?

If it is confirmed, as a consequence, if you want to be more resolutive in terms of interval_count (eg : in case of WGS presenting chromothripsis/chromoplexy events) it will be almost impossible to have 1 SNP per interval.

simozacca commented 6 years ago

Dear @BaptisteAmeline,

As you can read in the manual (https://github.com/raphael-group/THetA/blob/master/doc/MANUAL.txt) the usage of SNPs is part of a recommended but optional step. As far as I am aware of, there is no requirement about having a SNP per each interval_count.

However, to clarify your point (since almost all the allele-specific copy-number callers consider only segments having at least one heterozygous SNP), germinal SNPs are considered and their location is with respect to the reference genome which you used to align the reads. Therefore, catastrophic events as chromothripsis/chromoplexy do not affect in any way the position or presence of germinal-heterozygous SNPs in the reference genome. In the human-reference genome you may expect 1 SNP every 1k bases, and as such you should expect to have many SNPs in your intervals considering standard-bin sizes. The allele counts from these SNPs is used to infer the allele-specific copy numbers of each segment in the tumor sample due the effects of any aberration (including chromothripsis/chromoplexy). E.g. heterozygous SNPs should have a proportion of the alleles of 50%, if one of the alleles is lost, the expected proportions is 0%, assuming this occurs in all cells of the sample. You can search for B-allele frequency (BAF) to get to know more about this.