simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

htseq-qa not giving the true aligned reads count #95

Closed adirot closed 4 years ago

adirot commented 4 years ago

I have a sam file with 19807240 total reads with 403055 of them not aligned. The samtools stats command gave me this number. I ran htseq-qa on my file, and in the PDF it says that there are 22.2411 million aligned reads.... Why is that?

htseq-qa output: A_1sb_S8_R1_001_aligned.sam.pdf

the sam file: https://drive.google.com/file/d/1lO2SjjasO6CK8i2OI2a4fx1IWeil1g1y/view?usp=sharing

samtool stat output: A_1sb_S8_R1_001_aligned_stat.txt

iosonofabio commented 4 years ago

reads or alignments? It's not the same thing

adirot commented 4 years ago

The PDF produced by htseq says "aligned reads 22.24 million" : image

In the samtools stats I get: reads mapped: 19404185

Is there a difference between your "aligned reads" and samtool stat "reads mapped?"

(the "non-aligned reads" you present and samtools "unmapped reads" are the same....)

iosonofabio commented 4 years ago

Hmm let me have a look

iosonofabio commented 4 years ago

I'm not sure what samtools does, but htseq-qa counts alignments, not reads.

Reads are what goes into a mapper (e.g. STAR, bowtie, bwa) while alignments are what comes out of it.

The difference is that some reads map equally well to several repetitive regions of the genome so the mapper is confused. Depending on what options you ran your mapper with, it might have written more than one alignment for this "multimapper" reads.

A classical example is ribosomal RNA, which is poorly annotated in the human genome and can be present around 8 times for some moieties.

I'll add an option to the script to deal with this to some extent, as done previously for htseq-count.

iosonofabio commented 4 years ago

I added an option --primary-only in commit 13ba125, which will be in the next release.

Notice that python2 is at the end of life, so all development happens in Python 3 now. It is possible that the fix works in python2 as well, but please run it with python3 from now on as python2 support will be dropped soon.

Thanks for reporting.