sam-tags: `MM` field value pattern does not allow ambiguity codes

samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats

http://samtools.github.io/hts-specs/

632 stars 174 forks source link

sam-tags: `MM` field value pattern does not allow ambiguity codes #732

Open zaeleus opened 1 year ago

zaeleus commented 1 year ago

This is in regard to Sequence Alignment/Map Optional Fields Specification (2022-08-17).

The base modifications (MM) field allows modifications to be either short codes or an ChEBI ID. Short codes are constrained to [a-z]+ (i.e., lowercase letters) but the table of "standard common types" lists ambiguity codes that do not match this (i.e., uppercase letters).

Unmodified base Code Abbreviation Name ChEBI

C C Ambiguity code; any C mod

T T Ambiguity code; any T mod

U U Ambiguity code; any U mod

A A Ambiguity code; any A mod

G G Ambiguity code; any G mod

N N Ambiguity code; any mod

Unmodified base	Code	Name
C	C	Ambiguity code; any C mod
T	T	Ambiguity code; any T mod
U	U	Ambiguity code; any U mod
A	A	Ambiguity code; any A mod
G	G	Ambiguity code; any G mod
N	N	Ambiguity code; any mod

jkbonfield commented 1 year ago

The short codes are modified bases, so "m" and "h" being 5mC and 5hmC. It doesn't make any sense to have a base modification from nucleotide to ambiguity code, so I'm not sure I follow this.

We don't support ambiguity codes in the unmodified base component, so we couldn't do MM:Z:Y+h,4; for example as it wouldn't may sense. "N" covers this case anyway with the different counting regime.

zaeleus commented 1 year ago

I'm referring to the Code column of the standard common types table under the MM description. It defines codes that are uppercased, but the MM field pattern does not allow it: MM:Z:([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*. I referred to the short code portion as [a-z]+ originally.

The description for ML gives an example of using an ambiguous modification:

For example MM:Z:C+C,10; ML:B:C,229 indicates a C call with a probability of 90% of having some form of unspecified modification."

See that it uses C as the modification code, which does not match ([a-z]+|[0-9]+).

jkbonfield commented 1 year ago

Oh wow I'd totally forgotten about that!

Yes, the regexp should be ([a-zACGTUN]+|[0-9]+) for the code portion. Good spot. Thanks :)