samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Add MZ:i tag as a check for base modification validity. #1590

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

If a sequence is hard-clipped after calling the base modifications, then the tool may, or may not, update the MM and ML tags accordingly. We have no way of distinguishing these two cases. While the base modification parsing code already detects overflows where the coordinates go beyond the sequence end, this isn't fool proof, especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding the sequence length, to be written at the same time as the MM and ML tags. This can then be used as a sanity check later on, to detect cases where the sequence has changed length via a tool that is unaware of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases off the start/end of MM/ML strings to make it trivial for tools that are doing hard clipping via htslib. (Indeed we don't even have an API for SEQ/QUAL either, so it can do all together). This would make it far easier for people to keep everything in sync, and this code could then also update MZ while it's at it. That's new API though so it can arrive as a separate commit.

See https://github.com/samtools/hts-specs/issues/646

Edit: I forgot to add that I also tweaked the error messages to be slightly more useful by including the read name, but that's just trivial tweakage.