keithforest / ea-utils

Automatically exported from code.google.com/p/ea-utils
0 stars 0 forks source link

sam-stats fails to properly deduce unmapped reads and snp (ins/del) rate #21

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Probably an entirely bwa specific issue.

What steps will reproduce the problem?
0. Generated a bam using BWA 0.6.2
1. Filtered for unmapped reads using samtools view -f4 <your.bam>
2. Ran sam-stats on the above generated sam
3. Noticed that there was 1 read which had a non "* CIGAR string, had a NM 
value of 3 and had a corresponding MD string.

What is the expected output? What do you see instead?
I would expect sam-stats to determine mapped or unmapped based on the bit flag. 
After filtering the unmapped reads, it could then figure out other stats like 
snp, ins and del rates.

What version of the product are you using? On what operating system?
Version: 1.34.488

Please provide any additional information below.

$ grep "NM:" sample.unmapped.sam

HWI-ST1134:50:C0EBHACXX:8:1205:20100:107686     103     GL000195.1     111     
60     101M     =     111     0     
AAGAATTCCTCGTTCACACAGTTTCTTAAGCTTCCTGGGATGCGACCTGTGATGGCTCGGCGGAGCTCGGTGGCAGTTGT
CTCCCTCATCTCCAGTGACAC     
=>;?@BB@>BA9@BAC@B@CAABBABBAAA>BAA>BB>>ABB>9A@>BCABACB>>BA:?>9>???CB<<=@<?D?<>?=
>A?>?BB@??B@>A>@B>>A?     X0:i:1     X1:i:0     
BD:Z:KKLMMMLLMLLLLLLMLLLLNMLCLLKKKLNMKLMLNMKMMNMLLMLLNLLMMNMMMLLLMLLMMNMLLLLLMMM
NMLLLMLLMJLLMMMLMNNONMMMLL     MD:Z:0T0C74C24     RG:Z:1     XG:i:0     
BI:Z:PPPPQQPPPQPPPPPQOPOPQQPKPPPPPPQPPPPQQPNPPQQPPQQQQPPQPQPPPPPPPPPPPQPPPPPPPPP
QQPQPQPPPNQPQQOOOOOQQPQQPO     AM:i:37     NM:i:3     SM:i:37     XM:i:3     
XO:i:0     
OQ:Z:CCCFFFFFHHHHHIIJJGIJJHIIIJIJJIIIJJJIIJJIIJJJJJJJJIJJJJJJJIIHGFD?BBDDD><?@BD
?9>C:@CCDDCDBCCCCCCCDDDDDC     XT:A:U

Original issue reported on code.google.com by narayana...@gmail.com on 9 Sep 2013 at 5:55

GoogleCodeExporter commented 9 years ago
Right, sam-stats currently uses the existence of a chromosome and a position to 
mean "mapped".   

The reason for that is that bwa has some odd bit-field behavior where some 
reads are "mapped" but have no position, and others are "unmapped" but have 
both (and are correct).

(In your case, the correct alignment is at position 0 with two bases of 
soft-clipping)

I will change it to require both.   It's probably better.

Original comment by earone...@gmail.com on 13 Sep 2013 at 7:08

GoogleCodeExporter commented 9 years ago
New version checks bits, and reports+skips zero/negative positions as "error 
reads".

Original comment by earone...@gmail.com on 18 Sep 2013 at 2:24

GoogleCodeExporter commented 9 years ago
Awesome! Thank you for addressing this issue. Will test drive it and let you 
know if I stumble across something else.

I have another suggestion:
How about spitting out stats separately for read1 and read2 for paired-end 
bams? That way the stats don't get averaged out and assuming we have an issue 
with just read2, we could still salvage the read1 data.

Just a thought.

-Narayanan

Original comment by narayana...@gmail.com on 18 Sep 2013 at 3:43

GoogleCodeExporter commented 9 years ago
I usually run fastq-stats on read1 and read2 separately, and run the graphing 
tool for them as well.   After alignment, it's often too late, since some 
aligners will count very improper pairs as "unaligned" anyway... sometimes it's 
hard to tell which side is right.

if you had an extract from a bam file where 1 read was good and the other not, 
i'd like to see if that's caught by fastq-stats or whether it's something 
sam-stats is better at reporting

Original comment by earone...@gmail.com on 18 Sep 2013 at 5:54

GoogleCodeExporter commented 9 years ago
check the trunk version for new behavior

Original comment by earone...@gmail.com on 2 Oct 2013 at 4:15

GoogleCodeExporter commented 9 years ago

Original comment by earone...@gmail.com on 13 Nov 2013 at 10:13