Open zaeleus opened 1 year ago
The short codes are modified bases, so "m" and "h" being 5mC and 5hmC. It doesn't make any sense to have a base modification from nucleotide to ambiguity code, so I'm not sure I follow this.
We don't support ambiguity codes in the unmodified base component, so we couldn't do MM:Z:Y+h,4;
for example as it wouldn't may sense. "N" covers this case anyway with the different counting regime.
I'm referring to the Code column of the standard common types table under the MM
description. It defines codes that are uppercased, but the MM
field pattern does not allow it: MM:Z:([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*
. I referred to the short code portion as [a-z]+
originally.
The description for ML
gives an example of using an ambiguous modification:
For example
MM:Z:C+C,10; ML:B:C,229
indicates a C call with a probability of 90% of having some form of unspecified modification."
See that it uses C
as the modification code, which does not match ([a-z]+|[0-9]+)
.
Oh wow I'd totally forgotten about that!
Yes, the regexp should be ([a-zACGTUN]+|[0-9]+)
for the code portion. Good spot. Thanks :)
This is in regard to Sequence Alignment/Map Optional Fields Specification (2022-08-17).
The base modifications (
MM
) field allows modifications to be either short codes or an ChEBI ID. Short codes are constrained to[a-z]+
(i.e., lowercase letters) but the table of "standard common types" lists ambiguity codes that do not match this (i.e., uppercase letters).