samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
658 stars 172 forks source link

bam: Explain the use-case of the reference information section #713

Open zaeleus opened 1 year ago

zaeleus commented 1 year ago

§ 4.2 "The BAM format" (2022-08-22) describes the layout of a reference information section (n_ref...l_ref). However, there is no explanation on what this is used for or how it should be used, considering the reference sequence dictionary is typically parsed from the SAM header.

jmarshall commented 1 year ago

SAM files and especially BAM files need not have @SQ headers in their textual headers.

For reading a SAM file, this is no real problem, as an implementation could just add a new reference sequence to its internal dictionary when a previously unseen one is encountered in a SAM record. (Whether any implementations really support this mode of operation is unclear, but I believe it is still allowed by the specification.) Or the reference sequence dictionary could be provided externally in conjunction with the SAM file; cf samtools view -T etc.

For a BAM file, this is more of a problem, because the BAM alignment records do not contain the actual names of the reference sequences. Hence the binary header (the n_ref/l_name/name/l_ref fields) must always be present and complete in a BAM file and provides the canonical way to interpret the refID and next_refID indexes in the file's alignment records. As a bonus it also provides the length of each reference sequence.

Hence:

  1. In a BAM file, the binary header provides the canonical ordering of the reference sequences. (If @SQ headers are present in the textual header, as they usually are, they usually appear in the same order as in the binary header. But it is not an error if they appear in a different order, and the order defined by the binary header wins.)

  2. l_text might be zero.

This lore is fairly common knowledge amongst those who were implementing SAM and BAM at the time they were being defined, but you are quite right that it needs to be discussed in the specification.

jkbonfield commented 1 year ago

(Whether any implementations really support this mode of operation is unclear, but I believe it is still allowed by the specification.)

This is how gap5 worked. It just created and extended contigs as it went. Obviously nearly always the header data was available, so I'm not sure why I chose that route. Maybe it was just easier to implement the that way, as it had to cope with user editing, inserting, joining, etc so already needed tools that could dynamically grow contigs and create new binning indices as it went.

Agreed with the issue though - this all needs explicitly stating.