samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
647 stars 174 forks source link

In the SAM tags, are m5C and m6A RNA modifications represented as "m" and "a," respectively? Is this the same for BAM files? #796

Open Taylorain opened 1 week ago

Taylorain commented 1 week ago

Hi! In the SAM tags, are m5C and m6A RNA modifications represented as "m" and "a," respectively? Is this the same for BAM files?

While using Dorado for methylation detection on my DRS data, I noticed that U was automatically replaced with T in the generated BAM files. When analyzing methylation modifications, should I process this data as DNA or RNA?

How can I extract this methylation modification information? How can I analyze its distribution across the genome? Additionally, I would like to analyze differential methylation modifications. Could you recommend any software for these analyses?

Are the software tools for analyzing DNA and RNA methylation modifications the same?

jkbonfield commented 1 week ago

The contents of the SEQ field and the base modification aux tags are distinct. In BAM SEQ is stored in nibbles. It uses an alphabet of =ACMGRSVTWYHKDBN with = being a synonym for unspecified but matching reference. You may notice that the indices into this string have A=1, C=2, G=4 and T=8, and hence all the other IUPAC ambiguity codes are simply bit encoded combinations of those with N = 15. A consequence of that is it has no room for U, so they have to be converted to T.

SAM is simply ASCII. It doesn't really define the character set used, so "U" could exist technically, and even the case is (explicitly) undefined, but BAM obviously limits this (and htslib in turn as it also uses BAM nibble encoding for the internal memory representation, but other implementations may work different). Hence why we had to move base modifications to a separate side channel, namely the aux tags.

If your data is RNA, even though the stored sequence is T you may wish to use RNA based software as the difference is more than just U vs T (eg single vs double stranded), but I cannot advise what works and what doesn't and I don't know the tools at all. Hopefully someone else is following the discussions here and has a better grasp of software available for analysis, but it's outside the remit (and personally speaking, knowledge) of the specification maintainers.

Taylorain commented 1 week ago

For RNA data in a SAM file, can I interpret "m" and "a" as representing m5C and m6A modifications, respectively? For example, in the tag MM:z:A+a?,5,5,4, can I consider this as a description of m6A information?

jkbonfield commented 1 week ago

Yes, see the table in the SAMtags document. Specifically this line:

https://github.com/samtools/hts-specs/blob/master/SAMtags.tex#L562

Taylorain commented 6 days ago

I get, thanks pretty much