samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
638 stars 174 forks source link

detail definition of SM and NH field in SAM format #187

Open xuan-wang opened 7 years ago

xuan-wang commented 7 years ago

Hi, In the SAM Optional Fields Specification SAMtags.tex, there are fields termed as SM(Template-independent mapping quality) and NH(Number of reported alignments that contains the query in the current record).

BWA mem -a -M output sam doesn't contain them.

I am confused about the description and didn't find any detail definition about these two fields. I need these two field to filter reads for SV calling. Could anybody tell me the meaning and how to generate them?

Thanks a lot!

colinhercus commented 7 years ago

Hi Dicor,

We generate these in Novoalign.

NH will be output if a read has multiple mappings with similar alignment score. It is the number of mappings found not the number of SAM lines for the read. The HI tag is the number of mappings actually appearing as SAM records for the read.

SM we interpreted as MAPQ for the read if it wasn't part of a pair. For SVs the read isn't mapped as a proper pair and the SAM MAPQ field applies to each read independently. It's only in proper pairs that we calculate and report SM.

Colin

On 18 February 2017 at 23:04, Dicor notifications@github.com wrote:

Hi, In the SAM Optional Fields Specification SAMtags.tex http://samtools.github.io/hts-specs/SAMtags.pdf, there are fields termed as SM(Template-independent mapping quality) and NH(Number of reported alignments that contains the query in the current record).

BWA mem output sam doesn't contain them.

I am confused about the description and didn't find any detail definition about these two fields. I need these two field to filter reads for SV calling. Could anybody tell me the meaning and how to generate them?

Thanks a lot!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/187, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-rQhdJaNHfjYtyfT29Q9928xKQ_oks5rdwh4gaJpZM4MFJTN .

jmarshall commented 4 years ago

The SM description was expanded to “Template-independent mapping quality, i.e., the mapping quality if the read were mapped as a single read rather than as part of a read pair or template” in dea1302f843dea1b754f7ba36db0b8858231c184.

The current NH description is “Number of reported alignments that contain the query in the current record”. Aided by the comment here, I'm planning to expand that along the lines of

Number of reported alignments that contain the query in the current record. This is a total count of ‘acceptable’ alignments as identified by the aligner, and may be greater than the number of alignment records it actually emitted.

SAMtags currently describes HI as “Query hit index, indicating the alignment record is the i-th one stored in SAM” and IH as “Number of alignments stored in the file that contain the query in the current record”. (See also PR #326, which changed IH's description in passing and perhaps over-summarised its entry in the introductory table.) @colinhercus Is HI in your comment perhaps a typo? — IH instead would be consistent with the spec's descriptions.

colinhercus commented 4 years ago

@jmarshall You're right, I should have said IH is the number of alignments we reported and IH the index of the alignment. I believe we follow spec's as they were in 2009.

With regard your new definition for NH “Number of reported alignments" I find the use of 'reported' as confusing. For me 'reported alignments' would be mean alignments appearing in the report. Perhaps "Number of acceptable alignments identified by the aligner .... Where 'acceptable' is defined by the aligner."

jkbonfield commented 4 years ago

Agreed on "reported" not being the ideal word, as the output is essentially the report from the aligner. I'll also offer "identified" or "detected" as suitable alternatives; probably the latter as identified is used in the next sentence.

jmarshall commented 4 years ago

”Reported” has been there since 2009 but indeed if the word should be used at all, it should be in describing IH. Okay, I'll revisit that sentence too; PR incoming in due course…

(I'm presuming you meant “and HI the index of the alignment” :smile:)

colinhercus commented 4 years ago

Totally confused myself, transposition errors are always a problem for one finger typists :)