itmat / rum

RNA-Seq Unified Mapper
http://cbil.upenn.edu/RUM
MIT License
26 stars 4 forks source link

unmapped reads in sam file #146

Closed e-manduchi closed 11 years ago

e-manduchi commented 11 years ago

Hello, I'm a bit confused about the RUM.sam file produced by the software. From https://github.com/PGFI/rum/wiki/RUM-output-files it says that IH:i:N means the read aligns to N locations. Now, looking at one of my RUM.sam files I noticed that it contains ALL reads from the fastq file (based on the read ids). However not all these reads map. I had assumed that for the unmapped ones I would see IH:i:0, instead there is not such occurrence. Why? What's the most reliable annotation in the RUM.sam which will tell me whether a read mapped or not? Thanks.

delagoya commented 11 years ago

The full SAM specification is located here:

http://samtools.sourceforge.net/SAM1.pdf

Briefly, the first set of columns in a SAM record are required and the 3rd column (RNAME) either contains the reference sequence (e.g. "chr1") if mapped or "*" if it is unmapped. This is the best place to filter for unmapped reads.

-angel

On Wednesday, October 17, 2012 at 4:11 PM, e-manduchi wrote:

Hello, I'm a bit confused about the RUM.sam file produced by the software. From https://github.com/PGFI/rum/wiki/RUM-output-files it says that IH:i:N means the read aligns to N locations. Now, looking at one of my RUM.sam files I noticed that it contains ALL reads from the fastq file (based on the read ids). However not all these reads map. I had assumed that for the unmapped ones I would see IH:i:0, instead there is not such occurrence. Why? What's the most reliable annotation in the RUM.sam which will tell me whether a read mapped or not? Thanks.

— Reply to this email directly or view it on GitHub (https://github.com/PGFI/rum/issues/146).

e-manduchi commented 11 years ago

Thanks. Just found an older message from Greg saying that it was best to use the sixth column, the cigar string, if that's * then didn't align. Still, for these, shouldn't the IH:i be set to 0?

mdelaurentis commented 11 years ago

Actually, I would recommend using the second field (the FLAG field). According to the SAM file spec, "Bit 0x4 is the only reliable place to tell whether the segment is unmapped". So if you're using some program to process the SAM file, you should be able to do something like "$unmapped = $flag & 0x4" to find out of it is unmapped or not.

I don't remember why, but I believe there are some scenarios where the RNAME field can be something other than '*' and the read still didn't map. I think that may happen for paired reads where one maps but its mate doesn't.

On Wed, Oct 17, 2012 at 4:23 PM, e-manduchi notifications@github.comwrote:

Thanks. Just found an older message from Greg saying that it was best to use the sixth column, the cigar string, if that's * then didn't align. Still, for these, shouldn't the IH:i be set to 0?

— Reply to this email directly or view it on GitHubhttps://github.com/PGFI/rum/issues/146#issuecomment-9542357.

e-manduchi commented 11 years ago

Thanks. and I now see that unmapped reads don't have the IH:i at all!

delagoya commented 11 years ago

OK, I've updated the documentation to reference the SAM format specification and also an example of filtering using samtools and the bitwise FLAG.