virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.
Other
131 stars 42 forks source link

error !! #31

Open feihongloveworld opened 6 years ago

feihongloveworld commented 6 years ago

hi sir: i got a error . can you help me?

----------------------------

head WGS_A_D_5715_T.somatic.CNV.txt.bed

chr start end cn

12 66449999 66452000 8 18 19847999 20060000 13 18 20059999 20239000 11 18 20238999 20243000 8 18 20242999 20765000 8 20 33345999 33436000 6

----------------------------

python amplified_intervals.py --bed WGS_A_D_5715_T.somatic.CNV.txt.bed Traceback (most recent call last): File "amplified_intervals.py", line 58, in rdList = hg.interval_list([r for r in rdList0 if float(r.info[1]) > GAIN ]) IndexError: list index out of range

virajbdeshpande commented 6 years ago

Hello,

Thanks for reporting the issue. The amplified_intervals.py script assumes the bed file is the format generated by the tool ReadDepth where the expected copy number is reported in column number 5. The bed file you are using contains the copy number if column.

As a result, you will need to add a column to your bed file, the value in column 4 is not really used, so it does not matter what value to enter.

So you can transform your bed file to the following: 12 66449999 66452000 1 8 18 19847999 20060000 1 13 18 20059999 20239000 1 11 18 20238999 20243000 1 8 18 20242999 20765000 1 8 20 33345999 33436000 1 6

feihongloveworld commented 6 years ago

hi sir: dose the “amplified_intervals.py” script only work on hg19? so chr column of my cnv.bed do not have string of "chr" will be error ?

virajbdeshpande commented 6 years ago

The script should work without the 'chr' string, but there is a catch.

It filters out blacklisted intervals provided in the data_repo. I have not yet added the --ref option to this script, so it defaults the hg19 reference and will not be able to blacklist regions.

In your case, you are only looking at somatic amplifications which helps filter out the blacklisted regions. The example bed file you provided above does not contain any blacklisted regions, so the absence of the reference annotation will cause a problem.

ipstone commented 5 years ago

Hi @virajbdeshpande , is it possible to provide an example bed file in the repo - to demonstrate what is needed for AA to run properly? A short description of how to generate such as BED file will also be super helpful!

Another related question is for the BED file, column 5 - expected copy number: my samples don't have many high copy number regions, the highest copy number region only has 3. Will this be enough for AA tu run? - also the regions/segment are provided from facets call, which can be a long interval.

Thanks

virajbdeshpande commented 5 years ago

Hello Isaac,

That is a good suggestion. I can try to include an example as a tutorial for running AA. The bed file is essentially just a list of seed regions. Since you only have 3 regions, you may try to directly use these as the input bed file to AA. I would recommend checking that the entire chromosome or arm of the chromosome is not aneuploid. Alternatively, I have included a script amplified_intervals.py (see README and try the --help option) which can perform additional processing but requires the BED file to have the CN listed in the 5th column. Since you are targetting regions of low amplification (CN >= 3), you can run amplified_intervals.py with the option --gain 3.

I have tested AA on seed intervals with CN >= 5 and size > 100kbp up to ~10Mbp. AA can still run on CN = 3, however, the accuracy may vary from sample to sample. For long seed intervals, I would recommend downsampling the BAM file to 5x-10x coverage.