Open zaeleus opened 1 year ago
I have always thought that the text about “no assumptions can be made about [these fields] in [these circumstances]” is a mistake that will prevent us from defining meanings for those fields in those currently unspecified circumstances, as discussed in e.g. this samtools-devel thread.
However over a decade later, I fear the ship may have sailed on changing this policy.
Although "no assumptions can be made", if one were to be making assumptions, I would interpret a mapq defined over an unmapped read to be the mapping probability of the read being unmapped w.r.t the reference used. That is, reads for which the aligner had candidates and gave up (e.g. bwa when every seed has >500 occurrences in the reference) would be mapq0 (since it has no confidence that it's actually unmapped) and reads which don't match any reference (e.g. contamination, primers), would have non-zero mapq.
@d-cameron I agree that could have made sense, but it's too late now and I don't think it's how aligners work so I expect such values wouldn't be sensible calibrated anyway. Although some proxy based on complexity and length (ie an entropy estimation) could be used as an indicator that it is genuine data not found in this reference (but perhaps is part of this genome, eg from a long insertion).
@jmarshall - also agreed the language about "no meaning" isn't always helpful, but as you say the ship has sailed. However I think it is reasonable to make recommendations (not requirements) on the values when they are essentially NOPs. Eg CIGAR has no meaning for unmapped data, but sanity checkers may well gripe about something with CIGAR 151M
and indeed it's been reported as bugs before in aligners, despite it being technically valid SAM. So a footnote recommending the values to use when they have no meaning would be useful and wouldn't be a breaking change. I'm unsure with MAPQ though: CIGAR *
is "unavailable" and is used for unmapped data, so arguably MAPQ 255
is also correct here too, however I'm thinking more for the sake of conformity with existing practice.
As suggested by @jkbonfield in https://github.com/samtools/hts-specs/issues/715#issuecomment-1508097708: