samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
644 stars 174 forks source link

sam: Consider recommending using 0 for mapping quality when record is unmapped #727

Open zaeleus opened 1 year ago

zaeleus commented 1 year ago

As suggested by @jkbonfield in https://github.com/samtools/hts-specs/issues/715#issuecomment-1508097708:

as an aside, I note you're also using MAPQ of 255 for "unavailable". Commendable, but my experience is that everyone just uses 0 with unmapped data. I think this is because when FLAG 4 is set the specification states no assumption can be made about MAPQ, so it just feels cleaner to zero it out as all other fields have been.

jmarshall commented 1 year ago

I have always thought that the text about “no assumptions can be made about [these fields] in [these circumstances]” is a mistake that will prevent us from defining meanings for those fields in those currently unspecified circumstances, as discussed in e.g. this samtools-devel thread.

However over a decade later, I fear the ship may have sailed on changing this policy.

d-cameron commented 1 year ago

Although "no assumptions can be made", if one were to be making assumptions, I would interpret a mapq defined over an unmapped read to be the mapping probability of the read being unmapped w.r.t the reference used. That is, reads for which the aligner had candidates and gave up (e.g. bwa when every seed has >500 occurrences in the reference) would be mapq0 (since it has no confidence that it's actually unmapped) and reads which don't match any reference (e.g. contamination, primers), would have non-zero mapq.

jkbonfield commented 1 year ago

@d-cameron I agree that could have made sense, but it's too late now and I don't think it's how aligners work so I expect such values wouldn't be sensible calibrated anyway. Although some proxy based on complexity and length (ie an entropy estimation) could be used as an indicator that it is genuine data not found in this reference (but perhaps is part of this genome, eg from a long insertion).

@jmarshall - also agreed the language about "no meaning" isn't always helpful, but as you say the ship has sailed. However I think it is reasonable to make recommendations (not requirements) on the values when they are essentially NOPs. Eg CIGAR has no meaning for unmapped data, but sanity checkers may well gripe about something with CIGAR 151M and indeed it's been reported as bugs before in aligners, despite it being technically valid SAM. So a footnote recommending the values to use when they have no meaning would be useful and wouldn't be a breaking change. I'm unsure with MAPQ though: CIGAR * is "unavailable" and is used for unmapped data, so arguably MAPQ 255 is also correct here too, however I'm thinking more for the sake of conformity with existing practice.