samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

basemod API and hardclipped alignments #1566

Closed jts closed 1 year ago

jts commented 1 year ago

The MM tag is intended to be with respect to the instrument's basecalled sequence and hence can be copied as-is so aligners (and other tools) do not need to be aware of base modifications. In the case where the aligner hard clips the sequence and copies the MM tag however, the coordinates may no longer be valid. The current implementation of the API will process the tag for hardclipped alignments returning (potentially) invalid coordinates. Here is a small example sam:

@SQ     SN:chr1 LN:248956422
@CO     Basemod reference to hardclipped base
test-read       0       chr1    19931836        60      20H36M  *       0       0       AGGATCTCTAGCGGATCGGCGGGGGATATGCCATAT    *       Mm:Z:C+m,1;     Ml:B:C,12

Using bam_parse_basemod followed by bam_next_mod will result in a single hts_base_mod at position 7 of SEQ. I think it is preferable in this case for bam_parse_basemod to return -1.

Thanks, Jared

jkbonfield commented 1 year ago

Agreed, I think this is a good point.

Note manufacturers can use N as a the fundamental base, to make counting achievable even when there are hard-clips. I suspect I also do the maths incorrectly here though too.

jkbonfield commented 1 year ago

Fixed by #1590, but see also https://github.com/samtools/hts-specs/pull/714.

It's possible the tag name may change, so maybe 1590 merge was premature, but it's not a problem until we do another main release anyway and obviously it's trivial to change.