virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.
Other
134 stars 42 forks source link

A question about running amplified_intervals.py #78

Open tanzhengtang opened 4 years ago

tanzhengtang commented 4 years ago

Dear Dr.Viraj

I am trying to use AA for following your work to produce ecDNA data.When I run the amplified_intervals.py to produce the bed file which AA need,bad things happend.

my code is: python2 ./amplified_intervals.py --bed ~/ReadDepth/output/alts.dat --out ~/ecDNA/GBM39_WGS --bam ~/ecDNA/GBM39_WGS/GBM39_WGS-sorted.bam

the running process: read length 150.20154389117843 275.8876600787537 88.10978540948672 540.2170163072138 0.961664713508 coverage stats (50.03213427015154, 54.35358796278235, 18.043614780891268, 51.56919673597126, 53.720193309366884, 20.917019500264708, 150.20154389117843, 275.8876600787537, 88.10978540948672, 11.558303850293555, 540.2170163072138, 2.161444863117179, 0.9616647135081424) 979 pair support 2.161444863117179 Traceback (most recent call last): File "./amplified_intervals.py", line 86, in rdList = hg.interval_list([r for r in rdList if float(r.info[1]) > GAIN + 2 * max(1.0, bamFileb2b.median_coverage(refi=r)[0] / bamFileb2b.median_coverage()[0]) - 2]) File "/home/tang/ecDNA/AmpliconArchitect/src/bam_to_breakpoint.py", line 244, in median_coverage ii = hg.interval(hg.chrPos(chroffset)[0], hg.chrPos(chroffset)[1], hg.chrPos(chroffset)[1] + sumchrLen) TypeError: 'NoneType' object has no attribute 'getitem'

What should I do to reslove this question?

Thank so much!

Tangzheng

jluebeck commented 4 years ago

Hi Tangzheng,

At the moment, the bed file input for amplified_intervals.py requires the following format: chr start stop arbitrary_field estimated_CN Which is the output format for readDepth. A bed file which doesn't conform to that will cause the error above. We are about to release changes that remove this issue, and instead only require that the estimated_CN be the last column. You can find those changes on my fork of AA.

Best regards, Jens

tanzhengtang commented 4 years ago

Hi Tangzheng,

At the moment, the bed file input for amplified_intervals.py requires the following format: chr start stop arbitrary_field estimated_CN Which is the output format for readDepth. A bed file which doesn't conform to that will cause the error above. We are about to release changes that remove this issue, and instead only require that the estimated_CN be the last column. You can find those changes on my fork of AA.

Best regards, Jens

Thanks! I will check the format and try again.

tanzhengtang commented 4 years ago

Hi Tangzheng, At the moment, the bed file input for amplified_intervals.py requires the following format: chr start stop arbitrary_field estimated_CN Which is the output format for readDepth. A bed file which doesn't conform to that will cause the error above. We are about to release changes that remove this issue, and instead only require that the estimated_CN be the last column. You can find those changes on my fork of AA. Best regards, Jens

Thanks! I will check the format and try again.

Sorry to bother you again.When I check my alt.dat file,I find it seems to be right.

my alt.dat:

(ecDNA) ~/GBM39-CNV/output ᐅ head alts.dat 1 1 10500 5 23.7457924856368 1 10501 66400 99 0.550113037536563 1 66401 98450 93 0.0702479531675755 1 98451 551150 220 0.877274354649975 1 551151 571550 57 3.99556631705472 1 571551 695900 120 0.936986391569142 1 724101 727000 29 28.2582457118287 1 749301 750100 5 0.397657972390613 1 814501 822800 83 3.38815529113611 1 822801 826000 32 8.63921363325168

emmmmm,it really confuse me.

virajbdeshpande commented 4 years ago

Hello @groveswallow, Does the BAM file to have chromosome names in the format "1", "2", "3", .... or "chr1", "chr2", ..? If it is the first case, then can you try running the amplified_intervals.py from the latest commit with the option "--ref GRCh37"?

tanzhengtang commented 4 years ago

Hello @groveswallow, Does the BAM file to have chromosome names in the format "1", "2", "3", .... or "chr1", "chr2", ..? If it is the first case, then can you try running the amplified_intervals.py from the latest commit with the option "--ref GRCh37"?

Hello Dr.Viraj, Thank you for your reply. I have checked my bam file,It seems to be the second case. The head of GBM39_WGS-sorted.bam:

@HD VN:1.6 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 8 /home/tang/ecDNA/AmpliconArchitect/data_repo/hg19/hg19full.fa SRR8236745_1.fastq.gz SRR8236745_2. @PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools view -S -b - @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools sort -@ 4 -m 8G -O bam -o GBM39_WGS-sorted.bam GBM39_WGS.bam @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools view -h GBM39_WGS-sorted.bam SRR8236745.132135329 81 chr1 9995 0 92S54M5S chr5 10058 0 CCGAACCCGAACCCGAACCCGAACCCGAACCCTTACCCTAACACGAAGCCTAACCCT SRR8236745.9336411 83 chr1 9996 0 51S100M = 10010 -86 CCGCCCCCACCCCCGAACGCGACCGGTCCCCGGAGGTCTGACGCTAGCCCTTCCGATATCCCTAA SRR8236745.15418352 83 chr1 9996 0 58S93M = 10004 -85 CCCTCACCCTAAGCCTAACGCGATCGGGAAGGGTTACTCGATTGCTGACGCGTTCTCTTCCGATA

virajbdeshpande commented 4 years ago

It seems like an issue with "chr" is removed from the chromosome names by ReadDepth. You have 2 options - change the alts.dat file to add include "chr" in the chromosome names. Use PrepareAA script with Canvas: https://github.com/jluebeck/PrepareAA

tanzhengtang commented 4 years ago

It seems like an issue with "chr" is removed from the chromosome names by ReadDepth. You have 2 options - change the alts.dat file to add include "chr" in the chromosome names. Use PrepareAA script with Canvas: https://github.com/jluebeck/PrepareAA

Thank you!I will try them soonly.

tanzhengtang commented 4 years ago

It seems like an issue with "chr" is removed from the chromosome names by ReadDepth. You have 2 options - change the alts.dat file to add include "chr" in the chromosome names. Use PrepareAA script with Canvas: https://github.com/jluebeck/PrepareAA

Hi Dr.Viraj,I use PrepareAA script to get a bed file,and then produce ecDNA data,but i still get something seem not good.

my code: python ~/ecDNA/AmpliconArchitect/src/AmpliconArchitect.py --bed ~/PrepareAA/GBM39_AA_CNV_SEEDS.bed --bam ~/ecDNA/GBM39_WGS/GBM39_WGS-sorted-adhead-markdup.bam --out gbm39-ec --downsample 0

GBM39_AA_CNV_SEEDS.bed: chr7 54829879 56119660 88.9906071218 /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup_CNV_GAIN.bed chr7 69671291 70041249 6.15747984319 /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup_CNV_GAIN.bed chr7 87260878 87640883 6.48541996688 /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup_CNV_GAIN.bed chr8 128091498 130236500 8.28987300028 /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup_CNV_GAIN.bed

running process : [root:INFO] #TIME 0.308 Loading libraries and reference annotations for: hg19 [root:INFO] #TIME 2.173 Initiating bam_to_breakpoint object for: /home/tang/ecDNA/GBM39_WGS/GBM39_WGS-sorted-adhead-markdup.bam [root:INFO] #TIME 2.173 Exploring interval: chr7 54829879 56119660 88.9906071218 /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup_CNV_GAIN.bed /home/tang/miniconda3/envs/AA/lib/python2.7/site-packages/numpy/core/fromnumeric.py:3367: RuntimeWarning: Degrees of freedom <= 0 for slice **kwargs) /home/tang/miniconda3/envs/AA/lib/python2.7/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) [root:WARNING] dnlists do not match 396 394 [root:WARNING] dnlist0: chr7:55226402-->chr7:55226496+ 6 [root:WARNING] dnlist0: chr7:55174171-->chr7:55174632+ 17

Is the process is right?If not ,what should I do?

Another one question,when I run PrepareAA script to produce the bed which AA needs,actually it did not directly give me the bed file,it just give me the GBM39_WGS-sorted-adhead-markdup_CNV_GAIN.bed,and the process shut down before use amplified_intervals.py to get GBM39_AA_CNV_SEEDS.bed.

my code: ./PrepareAA.py -s GBM39 -t 8 --cnvkit_dir ~/cnvkit/cnvkit.py --sorted_bam ~/ecDNA/GBM39_WGS/GBM39_WGS-sorted-adhead-markdup.bam --downsample 5 --ref hg19 -o ~/PrepareAA --python3_path /home/tang/miniconda3/envs/ecDNA/bin/python

Runing process: Wrote /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup.call.cns with 451 regions Segments do not cover all bins (563846), only 563844 of them Significant hits in 0/563844 bins (0%) Wrote /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup.bintest.cns with 0 regions Dropped 2053 rows with missing log2 values Segmenting with method 'cbs', significance threshold 0.0001, in 8 processes Smoothing overshot at 3 / 4882 indices: (-1.330386417954986, 3.2225760897633946) vs. original (-2.3366599999999997, 3.0866700000000002) Smoothing overshot at 1 / 5152 indices: (-10.927016779626516, 0.6788878164218103) vs. original (-25.2941, 0.5432140000000001) Wrote /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup.cns with 1006 regions usage: amplified_intervals.py [-h] --bed FILE [--out FILE] [--bam FILE] [--gain GAIN] [--cnsize_min CNSIZE_MIN] amplified_intervals.py: error: unrecognized arguments: --ref hg19 Running PrepareAA on sample: GBM39 /home/tang/miniconda3/envs/ecDNA/bin/python /home/tang/cnvkit/cnvkit.py batch -m wgs -y -r /home/tang/PrepareAA/data_repo/hg19/hg19_cnvkit_filtered_ref.cnn -p 8 -d /home/tang/PrepareAA/cnvkit_output/ /home/tang/ecDNA/GBM39_WGS/GBM39_WGS-sorted-adhead-markdup.bam /home/tang/miniconda3/envs/ecDNA/bin/python /home/tang/cnvkit/cnvkit.py segment /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup.cnr -p 8 -o /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup.cns Running amplified_intervals python /home/tang/ecDNA/AmpliconArchitect/src/amplified_intervals.py --ref hg19 --bed /home/tang/PrepareAA/cnvkit_output/GBM39_WGS-sorted-adhead-markdup_CNV_GAIN.bed --bam /home/tang/ecDNA/GBM39_WGS/GBM39_WGS-sorted-adhead-markdup.bam --gain 4.999999 --cnsize_min 50000 --out /home/tang/PrepareAA/GBM39_AA_CNV_SEEDS Completed

It seems have a bug when running the script beacuse this "amplified_intervals.py: error: unrecognized arguments: --ref hg19".

virajbdeshpande commented 4 years ago

Are you using the latest commit from AA? If not can you do a git pull and try again?

tanzhengtang commented 4 years ago

Are you using the latest commit from AA? If not can you do a git pull and try again?

Ok,I get it. Thanks!

tanzhengtang commented 4 years ago

Are you using the latest commit from AA? If not can you do a git pull and try again?

Sorry to bother you again,the running process after using the latest AA:

[root:INFO] Commandline: /home/tang/ecDNA/AmpliconArchitect/src/AmpliconArchitect.py --bed /home/tang/PrepareAA/GBM39_AA_CNV_SEEDS.bed --bam /home/tang/ecDNA/GBM39_WGS/GBM39_WGS-sorted-adhead-markdup.bam --out gbm39-ec --downsample 0 --ref hg19
[root:INFO] AmpliconArchitect version 1.2 [root:INFO] #TIME 7.029 Loading libraries and reference annotations for: hg19 Global ref name is hg19 [root:INFO] #TIME 14.953 Initiating bam_to_breakpoint object for: /home/tang/ecDNA/GBM39_WGS/GBM39_WGS-sorted-adhead-markdup.bam [root:INFO] #TIME 14.954 Exploring interval: chr7 54829879 56119660 [root:INFO] #TIME 174.414 Searching new neighbors for interval: chr7 54729879 56219660 [root:INFO] #TIME 176.663 Calculating coverage meanshift segmentation [root:INFO] #TIME 1926.023 Detecting breakpoint edges [root:WARNING] dnlists do not match 414 412 [root:WARNING] dnlist0: chr7:55174171-->chr7:55174632+ 17 [root:WARNING] dnlist0: chr7:55226402-->chr7:55226496+ 6

virajbdeshpande commented 4 years ago

The warnings are expected. Is the output stuck at this place?

tanzhengtang commented 4 years ago

The warnings are expected. Is the output stuck at this place?

Yes,the output stuck here for some time,eventually I got those files: gbm39-ec_amplicon1_cycles.txt gbm39-ec_chr7_45884632_45895168_cnseg.txt gbm39-ec.chr7:87260878-87640883.out
gbm39-ec_amplicon1_edges_cnseg.txt gbm39-ec_chr7_54729879_56219660_cnseg.txt gbm39-ec_chr8_127991498_130336500_cnseg.txt
gbm39-ec_amplicon1_edges.txt gbm39-ec.chr7:54829879-56119660.out gbm39-ec.chr8:128091498-130236500.out
gbm39-ec_amplicon1_graph.txt gbm39-ec_chr7_56395739_56406275_cnseg.txt gbm39-ec.log gbm39-ec_chr3_111264085_111274620_cnseg.txt gbm39-ec_chr7_87260878_87640883_cnseg.txt gbm39-ec_summary.txt

And some files have no result,such as:gbm39-ec_amplicon1_cycles.txt,gbm39-ec_amplicon1_graph.txt,gbm39-ec_chrxxxxx.out

The content of gbm39-ec_summary.txt:

Amplicons = 3


[amplicon1] AmpliconID = 1 [amplicon1] #Intervals = 2 [amplicon1] Intervals = chr3:111264085-111274620,chr8:127991498-130336500 [amplicon1] OncogenesAmplified = MYC,

Is the work successful?

jluebeck commented 4 years ago

Hi, it appears AA did not finish correctly. We would expect three amplicons of output. Could you share the gbm39-ec.log file? One suggestion, did you remove all files created by AA on previous failed runs? I.e., run again in a new directory or the old directory with all old files removed. Second, could you reset the contents of the file coverage.stats in the AA_DATA_REPO? Clear the contents so it is an empty file. Keep in mind that after reseting you may need to run chmod a+rw coverage.stats.

virajbdeshpande commented 4 years ago

@jluebeck @groveswallow. So far the output seems as expected, it is just taking a while to run. The cycles file, graph file and .out files for amplicon 1 will get populated after it is done generating the amplicon structure for amplicon 1, then AA will start reconstructing amplicon 2.

virajbdeshpande commented 4 years ago

@groveswallow, If you can share the .log file with me, I can check try to assess the cause of slowdown.

tanzhengtang commented 4 years ago

emmm,the reason of process about shut down seems to be the license of mosek.The log file has uploaded. gbm39-ec.log and the tail of nohup.txt: [root:INFO] #TIME 14725.387 Selecting neighbors [root:INFO] #TIME 14874.208 New neighbor: chr3 111264085 111274620 16 (weight=16) [root:INFO] #TIME 14875.148 Searching new neighbors for interval: chr3 111264085 111274620 16 [root:INFO] #TIME 14875.164 Calculating coverage meanshift segmentation [root:INFO] #TIME 15002.672 Detecting breakpoint edges [root:INFO] #TIME 15003.601 Selecting neighbors [root:INFO] #TIME 15003.580 Interval sets for amplicons determined: [root:INFO] [amplicon1] chr3:111264085-111274620,chr8:127991498-130336500 [root:INFO] [amplicon2] chr7:45884632-45895168,chr7:54729879-56219660,chr7:56395739-56406275 [root:INFO] [amplicon3] chr7:87260878-87640883 [root:INFO] #TIME 15003.582 Reconstructing amplicon1 [root:INFO] #TIME 15003.582 Calculating coverage meanshift segmentation [root:INFO] #TIME 15003.582 Detecting breakpoint edges [root:INFO] #TIME 15780.138 Building breakpoint graph [root:INFO] #TIME 15790.069 Optimizing graph copy number flow Traceback (most recent call last): File "/home/tang/ecDNA/AmpliconArchitect/src/AmpliconArchitect.py", line 285, in bamFileb2b.interval_filter_vertices(ilist, amplicon_name=amplicon_name, runmode=args.runmode) File "/home/tang/ecDNA/AmpliconArchitect/src/bam_to_breakpoint.py", line 2160, in interval_filter_vertices task.optimize() File "/home/tang/miniconda3/envs/AA/lib/python2.7/site-packages/mosek/init.py", line 145, in accept return fun([ t(a) for (t,a) in zip(argtlst,args) ]) File "/home/tang/miniconda3/envs/AA/lib/python2.7/site-packages/mosek/init.py", line 37, in sync return f(self,args,**kwds) File "/home/tang/miniconda3/envs/AA/lib/python2.7/site-packages/mosek/init.py", line 8015, in optimize raise Error(rescode(res),msg) mosek.Error: (1001) The license has expired.

jluebeck commented 4 years ago

Thanks, and yes, that happens periodically. You can visit the AA readme and follow the link to get another license. For the academic license, I believe you can obtain another license after the previous has expired without any issues.

tanzhengtang commented 4 years ago

Thanks, and yes, that happens periodically. You can visit the AA readme and follow the link to get another license. For the academic license, I believe you can obtain another license after the previous has expired without any issues.

Thank so much,I will run AA again with your advice.And are the 3 amplicons reported by nohup.txt right?

virajbdeshpande commented 4 years ago

The log output so far seems fine to me. As you suggested it is getting stuck due to the mosek license.. Were you able to update the mosek license and rerun?

tanzhengtang commented 4 years ago

Yes,I updated the license and ran AA again.In this time,I got 3 amplicons from gbm39-wgs data.But I don't understand the difference in amplicon and ecDNA.Is the ecDNA part of amplicon or same as amplicon?And my apmlicon_summary.txt: gbm39-ec_summary.txt

jluebeck commented 4 years ago

Hi, The amplicons reported by AA are general focal amplifications, which can be created by many mechanisms including ecDNA. If you would like to classify the status of these amplicons to produce a bioinformatic labeling of their ecDNA status (and extract cyclic regions), please check out AmpliconClassifier.

tanzhengtang commented 4 years ago

Thanks!I have used the AmpliconClassifier to produce one of my amplicon status,according to output of gbm39_ec_amplicon_calssifier_amplicon_classification_profiles.txt ,this amplicon is circular with a probability of produced by ecDNA?

jluebeck commented 4 years ago

Hi,

Yes the amplicon recieved "Positive" in the ecDNA+ column, and thus it is predicted that amplicon1 contains ecDNA. You can use --extract_genes to get a bed file specifying the cyclic regions.

Best, Jens