rogerjms / bedtools

Automatically exported from code.google.com/p/bedtools
0 stars 0 forks source link

coverageBed fails for BAM files generated by Novoalign #76

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. I run novoalign to map reads against the genome with the SAM output
2. SAM files has been converted into BAM files with samtools. 
3. The final BAM files have no leading "@" annotation tags and have some 
special comment generated by novoalign. 

What is the expected output? What do you see instead?
When I run coverageBed with the following command: 
coverageBed -abam sorted.bam -b miRNA.gff -hist 
The coverage column is 0 and there is also no histogram generated. For example, 
the first several lines of output:

2L     FlyBase miRNA   6045678 6045698 .       -       .       
ID=FBtr0304482;Name=mir-966-RA;Derives_from=FBtr0304481;Alias=CR43024-RA;Dbxref=
FlyBase_Annotation_IDs:CR43024-RA,miRBase:MIMAT0005481  0       21      21      
1.0000000
2L      FlyBase miRNA   243055  243076  .       -       .       
ID=FBtr0304344;Name=mir-965-RA;Derives_from=FBtr0304343;Alias=CR42959-RA;Dbxref=
FlyBase_Annotation_IDs:CR42959-RA,miRBase:MIMAT0005480  0       22      22      
1.0000000
2L      FlyBase miRNA   857596  857617  .       +       .       
ID=FBtr0304183;Name=mir-375-RA;Derives_from=FBtr0304182;Alias=CR42884-RA;Dbxref=
FlyBase_Annotation_IDs:CR42884-RA,miRBase:MIMAT0005472  0       22      22      
1.0000000
2L      FlyBase miRNA   1831710 1831731 .       -       .       
ID=FBtr0304393;Name=mir-2280-RB;Derives_from

What version of the product are you using? On what operating system?
I used bedtools v2.12.0 on redhat linux CentOS release 5.5 (Final). 

The first several lines of my input is like:
1456_1113_1698_F3       0       chr2L   5307    3       15M20H  *       0       
0       GTATGCGAGAGTGGT :BBB@BB@?-5?B@: PG:Z:novoalign  AS:i:0  UQ:i:0  NM:i:0  
MD:Z:15
1629_1494_717_F3        16      chr2L   5311    5       12H2S21M        *       
0       0       AGGCGAGAGTGGTGCCAACATAT B<:;BB?%7=>@A<@B@?@BBA> PG:Z:novoalign  
AS:i:32 UQ:i:32 NM:i:0  MD:Z:21
1445_824_1138_F3        0       chr2L   5320    8       23M12H  *       0       
0       GTGCCAACATATTGTGCCCTTCG BBBBBBB@A@BB:B?@AB=B<97 PG:Z:novoalign  AS:i:30 
UQ:i:30 NM:i:1  MD:Z:17T5
1630_1474_1992_F3       0       chr2L   5361    2       15M20H  *       0       
0       ATGGAGGCGGATGAA B:ABBA;BB<A4*>: PG:Z:novoalign  AS:i:1  UQ:i:1  NM:i:0  
MD:Z:15
1476_1408_580_F3        0       chr2L   5364    11      21M14H  *       0       
0       GAGGCGGATGAACGAGATGAT   @BB@B@?B697B5/;A5@<?=   PG:Z:novoalign  AS:i:0  
UQ:i:0  NM:i:0  MD:Z:21
1401_428_992_F3 0       chr2L   5364    11      21M14H  *       0       0       
GAGGCGGATGAACGAGATGAT   @>A:BBAB?69B?0<?;A9<>  
PG:Z:novoalign  AS:i:0  UQ:i:0  NM:i:0  MD:Z:21
1483_1145_1939_F3       0       chr2L   5480    25      19M16H  *       0       
0       TCGAACATAGAACATAGGC     BBBBB@BB>BB@%=ABA4=     PG:Z:novoalign  AS:i:2  
UQ:i:2  NM:i:0  MD:Z:19
1517_796_204_F3 16      chr2L   5562    38      13H1S21M        *       0       
0       GGCACGACATAGAGAGAAAGAG  ':=:B@=;5@B@;BBA>=BBBB  PG:Z:novoalign  AS:i:11 
UQ:i:11 NM:i:0  MD:Z:21
1553_1556_1839_F3       0       chr2L   5800    30      22M5S8H *       0       
0       TCGAATATATATTGCCTGCCTCGCCTT     BBB>BBBB?BBB7;BA???@>@4:?@6     
PG:Z:novoalign  AS:i:54 UQ:i:54 NM:i:0  MD:Z:22
1482_313_317_F3 0       chr2L   5800    33      22M5S8H *       0       0       
TCGAATATATATTGCCTGCCTCGCCTT     BBB:B:?B;@BB29B>9;>9:@1:A@6     PG:Z:novoalign  
AS:i:51 UQ:i:51 NM:i:0  MD:Z:22

Please provide any additional information below.

I am kind of wondering whether this issue is related with novoalign or loss of 
annotation lines in SAM/BAM files (@lines). 

Thanks and any feedback is welcome.

Original issue reported on code.google.com by PkuOnl...@gmail.com on 26 Jul 2011 at 4:46

GoogleCodeExporter commented 9 years ago
To me, it looks like your issue is that your chromosome names have the form 
"chr2L" in your BAM file, while your chromosomes have the form "2L" in your GFF 
file.  Bedtools, like all tools, expects the chromosome labels to match in 
order to track overlapping alignments.  A simple fix would be to prefix you GFF 
entries with "chr" and retry:

awk '{print "chr"$0} miRNA.gff > miRNA-withchr.gff
coverageBed -abam sorted.bam -b miRNA-withchr.gff -hist

Original comment by aaronqui...@gmail.com on 26 Jul 2011 at 4:53

GoogleCodeExporter commented 9 years ago
OMG, I should have noticed about this issue. A million thanks for the rapid 
help!

Original comment by PkuOnl...@gmail.com on 26 Jul 2011 at 5:01