lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.51k stars 551 forks source link

NH tag #211

Open apredeus opened 6 years ago

apredeus commented 6 years ago

I've searched for a while, but all I find is very old threads with many confused people, so I figured I will ask here.

Apparently, bowtie2 and bwa do not report NH tag, which supposed to label how many times do you see this particular read ID in the resulting alignment (as far as I understand). Normally I would say this is not a big deal since there is MAPQ fields, and you can filter your results if you need to.

However, there is a specific case when it starts to matter quite a lot: bacterial RNA-seq. Since there's no splicing, people often tend to use a fast and robust genomic short-read aligner, like bowtie2. However, the downstream tools for RNA-seq analysis rely on NH tag - for example, featureCounts does.

So, is there a specific reason bwa does not report the NH tag? And is there an easy way to add it to the resulting SAM/BAM file? Thank you for any hints.

lh3 commented 6 years ago

is there a specific reason bwa does not report the NH tag?

No specific reason. It is just that for genomic alignment, we don't use this tag.

And is there an easy way to add it to the resulting SAM/BAM file?

How to reports hits can be tricky. For example, do you only count equally best hits or include suboptimal hits?

the downstream tools for RNA-seq analysis rely on NH tag

Due to the ambiguity, you should use this tag.

apredeus commented 6 years ago

How to reports hits can be tricky. For example, do you only count equally best hits or include suboptimal hits?

Ah. Thank you for bringing my attention to this. So far "-k" mode seems to have exactly the same description in RNA-seq aligners (hisat2/STAR) as in bwa/bowtie2, but the results are quite different, even on bacterial genome with explicitly forbidden splicing. Hisat2 and STAR report about 5% more unique alignments.

I've checked and that's because Hisat2/STAR only report reads as multimapping when mapping scores are all the same, and if they are not, they report the best one.

Due to the ambiguity, you should use this tag.

You mean NOT use it, right? Unfortunately, several read counting programs I've encountered rely on it to identify the multimapping reads.

Thank you very much for taking time to answer.

zqlns commented 4 months ago

hi Apredus,

I'm currently using RMATS to analyze RNA-seq data, but the BAM files don't have the NH tag. I was wondering if you have found a good method to add the NH tag. Your help would be greatly appreciated.

Best, Qiong

apredeus commented 4 months ago

Hi Qiong,

I went with a more specialised RNA-seq aligner in the end, because what I was trying to do is to map short reads (but my reference was really big, so I wanted minimap2 for its efficiency). I think there are several long read mappers being developed for long read RNA-seq; some would probably have the NH tag assigned correctly. Here's a paper that I think should be useful: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02972-3