uec / Issue.Tracker

Automatically exported from code.google.com/p/usc-epigenome-center
0 stars 0 forks source link

BSMAP doesn't seem to set 0x0040 and 0x0080 flags correctly #57

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
As far as I can tell from the BAM spec, flag 0x0040 is supposed to correspond 
to read1 from a read pair, while 0x0080 is supposed to correspond to read2.  
But BSMAP doesn't seem to output it this way.  I think maybe it sets 0x0040 to 
the first one in genome coordinates.  For now (in 
LocusWalkerToBisulfiteCytosineWalker), I am looking for "/1" or "/2" in the 
read name itself, but I think this is not going to be reliable long term.  We 
should check with the author of BSMAP and check the SAMTOOLS spec, and then get 
it fixed.  

I looked at a few messages from the SAM/BAM email list, I think I am 
interpreting the meaning of the 0x0040 and 0x0080 flags correctly.

Original issue reported on code.google.com by benb...@gmail.com on 19 Apr 2011 at 1:25

GoogleCodeExporter commented 8 years ago
I emailed the following to the BSMAP google code site:

----

I am using BSMAP for the first time, and I am wondering about the BAM flags 
0x0040 and 0x0080.  The way I understand the spec, these should be used to mean 
paired-end read1 and paired-end read2, respectively.  So in Illumina naming, 
the "/1" should always go to 0x0040 and the "/2" should always go to 0x0080.  
In BSMAP v1.15, it seems like this is not getting set correctly.  It seems like 
the "first" sorted read, which is forward strand relative to the genome 
assembly, is always set to 0x0040, and the second sorted read is always 0x0080. 
 For instance, I get the following output (omitting sequences):

HWI-ST550_0085:7:1101:7291:57138#0/2    0x63    chr11   363691  255     105M    
=       363705  119     NM:i:0  ZS:Z:--
HWI-ST550_0085:7:1101:7291:57138#0/1    0x93    chr11   363705  255     105M    
=       363691  119     NM:i:0  ZS:Z:-+
HWI-ST550_0085:7:1101:11239:54670#0/1   0x63    chr11   368130  255     105M    
=       368362  337     NM:i:1  ZS:Z:++
HWI-ST550_0085:7:1101:11239:54670#0/2   0x93    chr11   368362  255     105M    
=       368130  337     NM:i:0  ZS:Z:+-

It seems to me that the correct flags for these, in order, should be 
0xA3,0x53,0x63,0x93

Do you agree?  For some applications, including bisulfite, it's important to be 
able to distinguish the two ends.

Thanks very much,
Ben Berman, USC.
----

Original comment by benb...@gmail.com on 19 Apr 2011 at 1:36

GoogleCodeExporter commented 8 years ago
This issue is what was throwing exceptions in picard. I turned off all 
validation to get most things running, but some tools that have specific PE 
"modes" seem to fail. 

Picards complains:

Exception in thread "main" java.lang.RuntimeException: SAM validation error: 
WARNING: Record 19, Read name HWI-ST550_0085:7:1101:6030:54164#0/1, Paired read 
should be marked as first of pair or second of pair.

Original comment by zack...@gmail.com on 19 Apr 2011 at 7:24

GoogleCodeExporter commented 8 years ago
This may be the case for some of them, but I went back to my original bsmap sam 
file, and it does set the 0x0040 or 0x0080 flags for the correctly paired 
alignments.  But it seems to be setting them wrong. I'm not sure what the 
original flags were for one you show above -- would be good to look it up.

Original comment by benb...@gmail.com on 19 Apr 2011 at 1:31

GoogleCodeExporter commented 8 years ago
The one you show above indeed doesn't have this flag, but this is the minority. 
 They are ones that are mapped but not mapped in a correct pair.  I believe 
this is wrong as well (0x0040 and 0x0080 flags should still be used regardless 
of alignment).

samtools view -x ResultCount_B031CABXX_7_NIC1254A2.bam | grep 
0085:7:1101:6030:54164
sHWI-ST550_0085:7:1101:6030:54164#0/2   0x21 
    chr3    150025300   255 99M =   150025397   0   CCTCATCTTTTTCATCTTCTTCATCTTCCTCTTCCTTCTTTT
TCTTACTTTTTTCAAACTTAATAACCCCTTTTTTTACTACATCAAACTTTCCTTAAC   ggggggggfgggffggfgfged
dfgffeed_ddcdfgggfggggfedee\LLMML^da\_cddcf_eececdee_bb]ee^ZaXd\\cc_ddada[b_c   NM
:i:7    ZS:Z:++
HWI-ST550_0085:7:1101:6030:54164#0/1    0x11 
    chr3    150025397   255 105M    =   150025300   0   ACCCAAAATACAACAATATCCTTTTCATATTATTCTTTCAA
CTTCACGACCTTCTTTTCGTAAATCTACTTATCATCTACAACAATATTATTTCCTATCTCTCCA    e\aVeffdefbffaf
`]`ZN`ddd_cbc`cad`Nd_eadfbecdadeedfdc_ecee_feefffefe^ffdfade]fdffdffffbfdffcfcef
bccc^b_bbb  NM:i:0  ZS:Z:-+

Original comment by benb...@gmail.com on 19 Apr 2011 at 1:40

GoogleCodeExporter commented 8 years ago
---------- Forwarded message ----------
From: Ben Berman <benbfly@gmail.com>
Date: Tue, Apr 19, 2011 at 8:43 AM
Subject: Re: bsmap - Google Groups: Message Pending 
[{IISIh5fX69_5NioCeWIwAUjO-5bdPoNB0}]
To: Yuanxin Xi <yxi@bcm.edu>
Cc: bsmap <bsmap+msgappr@googlegroups.com>, "Li, Wei" <wl1@bcm.tmc.edu>, zack 
Ramjan <ramjan@usc.edu>

Thanks for the quick reply Yuanxin.  I know the spec is not very clear, but 
after looking back through some of the SAMTOOLS discussion board and code, I'm 
pretty sure what I said is correct (the 0x40 should correspond to the /1 read 
in Illumina, and 0x80 to the /2 read).  It doesn't seem like there would be 
much use for flags that only tell you which is first in mapping coordinates 
(it's redundant because you could just look at the coordinates).

Also, in some cases where there is a bad alignment, I noticed that BSMAP 
doesn't set either 0x0040 or 0x0080, and Picard will complain about this:
Exception in thread "main" java.lang.RuntimeException: SAM validation error: 
WARNING: Record 19, Read name HWI-ST550_0085:7:1101:6030:54164#0/1, Paired read 
should be marked as first of pair or second of pair.

On Tue, Apr 19, 2011 at 8:16 AM, Yuanxin Xi <yxi@bcm.edu> wrote:
Hi Ben,

According to samtools spec 1.4, flag 0x40 refers to the first segment of the 
template and 0x80 refers to the last segment of the template.  So my 
understanding is that it depends on the mapping coordinates, instead of which 
read set in pair-end seq.  For properly paired reads, BSMAP set the read with 
lower genomic coordinate with 0x40 and the other with 0x80.  

The first two reads in your example maps to the '-+' and '--' strands, which is 
the bisulfite converted Crick strand and its reverse complementary strand 
respectively.  Although the '-+' read comes from forward strand, its genomic 
location is behind the '--' read, so '--' read is set 0x40.  One difference 
between bisulfite sequencing and normal sequencing is that the bisulfite 
converted crick strand no longer reverse-complements the bisulfite Watson 
strand, so there're actually 4 distinct reference strands.
Best regards,

Yuanxin

Original comment by benb...@gmail.com on 19 Apr 2011 at 3:43

GoogleCodeExporter commented 8 years ago
In the meantime, make correction script to fix BSMAP SAM output.

Original comment by zack...@gmail.com on 19 Apr 2011 at 8:45

GoogleCodeExporter commented 8 years ago
I made a GATK script to correct BSMAP SAM output.  I tested it on a small test 
file, Zack you may have to test it on a large file.  Location:
http://uecgatk.svn.sourceforge.net/viewvc/uecgatk/trunk/src/edu/usc/epigenome/ue
cgatk/benWalkers/BsmapPairedEndFixerWalker.java?view=markup

Original comment by benb...@gmail.com on 20 Apr 2011 at 11:08

GoogleCodeExporter commented 8 years ago

Original comment by benb...@gmail.com on 21 Apr 2011 at 1:04

GoogleCodeExporter commented 8 years ago

Original comment by benb...@gmail.com on 21 Apr 2011 at 1:04

GoogleCodeExporter commented 8 years ago
--------- Forwarded message ----------
From: Yuanxin Xi <yxi@bcm.edu>
Date: Thu, Apr 21, 2011 at 8:04 AM
Subject: Re: bsmap - Google Groups: Message Pending 
[{IISIh5fX69_5NioCeWIwAUjO-5bdPoNB0}]
To: Ben Berman <benbfly@gmail.com>
Cc: bsmap <bsmap+msgappr@googlegroups.com>, "Li, Wei" <wl1@bcm.edu>, zack 
Ramjan <ramjan@usc.edu>

HI Ben,

I confirmed that you are right about the 0x40/0x80 flag.  I've made the change 
in the latest BSMAP release v1.25.  Thanks for pointing this out.

Best,

Yuanxin 

Original comment by benb...@gmail.com on 21 Apr 2011 at 4:01

GoogleCodeExporter commented 8 years ago
bsmap 1.25 deployed and in use at hpcc.

are we ready to close this ticket?

Original comment by zack...@gmail.com on 5 May 2011 at 12:34

GoogleCodeExporter commented 8 years ago
I confirmed that the 0x0040 and 0x0080 flags were set correctly in 
ResultCount_B03BUABXX_8_NIC1252A27.bam.  So that means it's working?

Original comment by benb...@gmail.com on 5 May 2011 at 12:41

GoogleCodeExporter commented 8 years ago
the flags are being set correctly. picard and gatk are handling inputs 
correctly.

Original comment by zack...@gmail.com on 12 Jul 2011 at 6:35