Closed adirot closed 4 years ago
reads or alignments? It's not the same thing
The PDF produced by htseq says "aligned reads 22.24 million" :
In the samtools stats I get: reads mapped: 19404185
Is there a difference between your "aligned reads" and samtool stat "reads mapped?"
(the "non-aligned reads" you present and samtools "unmapped reads" are the same....)
Hmm let me have a look
I'm not sure what samtools does, but htseq-qa
counts alignments, not reads.
Reads are what goes into a mapper (e.g. STAR, bowtie, bwa) while alignments are what comes out of it.
The difference is that some reads map equally well to several repetitive regions of the genome so the mapper is confused. Depending on what options you ran your mapper with, it might have written more than one alignment for this "multimapper" reads.
A classical example is ribosomal RNA, which is poorly annotated in the human genome and can be present around 8 times for some moieties.
I'll add an option to the script to deal with this to some extent, as done previously for htseq-count
.
I added an option --primary-only
in commit 13ba125, which will be in the next release.
Notice that python2 is at the end of life, so all development happens in Python 3 now. It is possible that the fix works in python2 as well, but please run it with python3 from now on as python2 support will be dropped soon.
Thanks for reporting.
I have a sam file with 19807240 total reads with 403055 of them not aligned. The samtools stats command gave me this number. I ran htseq-qa on my file, and in the PDF it says that there are 22.2411 million aligned reads.... Why is that?
htseq-qa output: A_1sb_S8_R1_001_aligned.sam.pdf
the sam file: https://drive.google.com/file/d/1lO2SjjasO6CK8i2OI2a4fx1IWeil1g1y/view?usp=sharing
samtool stat output: A_1sb_S8_R1_001_aligned_stat.txt