samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
638 stars 174 forks source link

sam: Quality score ambiguity when sequence is a single base #715

Open zaeleus opened 1 year ago

zaeleus commented 1 year ago

This in regard to Sequence Alignment/Map Format Specification (2022-08-22) § 1.4 "The alignment section: mandatory fields".

In the following SAM record, the quality scores field (QUAL) is ambiguous.

*   4   *   0   255 *   *   0   0   A   *

Since there is a singe base in the sequence, the quality scores field can either be unavailable (*) or represent [9].

jkbonfield commented 1 year ago

This has been a known issue for a long time, although probably not tracked here. I don't think there's anything we can do about it really. Fortunately, it also means a length 1 sequence which doesn't generally happen in the wild, so it's a moot point. Most implementations just take the most probable view which is to interpret is as unknown and attempting to remove ambiguity would turn a harmless issue into a potentially more serious one.

Edit: as an aside, I note you're also using MAPQ of 255 for "unavailable". Commendable, but my experience is that everyone just uses 0 with unmapped data. I think this is because when FLAG 4 is set the specification states no assumption can be made about MAPQ, so it just feels cleaner to zero it out as all other fields have been.

jkbonfield commented 1 year ago

TODO: Add footnote to say a single "*" for length 1 is still "unavailable"