nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
495 stars 59 forks source link

Interpretation of MM:Z:A+a? tag from m6A basecaller model #686

Closed sarah-ku closed 6 months ago

sarah-ku commented 6 months ago

Hope you don't mind us asking here about the output from the m6A methylation model.

We converted our DRS (rna004 kit) data from fast5 to pod5 and tested out the command:

dorado basecaller sup,m6A_DRACH data.pod5 > output.bam

here is an example of one of the entries from the resulting BAM file:

414b0961-3412-4660-a4f1-118de6d22719 4 * 0 0 * * 0 0 GGCGAGCAGGGAGGCAAAGCTCGCGCCAAGGCCAAGACCCGCTCTTCTCGGGCCGGGCTCAGTTTCCCGTGGGGGCCGAGTGCATCGCCTGCTCCGCAAAGGCAACTGCACGGCGGAGCGGGTGCTGGAGCTCCGGTGTCCCTGGCGGCGGTGCTGGAGTACCTGACCATCGAGATCCTGGAGCTGGCTGGCAACGCGGCCGCGACAACAAGAAGAATTCGTCATCATCCCCGCGCACCTCGAGCTGGCCATCCGCAACGATGAGGAGCTCAACAAGCTTCTGGGCAAGTCATACATGGTGGCGTCCTGCCCAACATCCAGGCCGTGCTACTGCCCAAGAAGACCGAGAGCCAAGGCGGGCAAGTAGAAGCCTGGATTAGTTTGCAGCAACTCAATCCCAAGGAACCAAAGGCTCAGAGCCTTGGGGTGGCCCCAGCCCCCACCCCCGCCCTACAACTTATCAGCCCATATCAACCCTGCCCCCTCCCCCTCGCCCCCTCGCCCTCTCAAAACACCCC ((((???==>;<9101869844343196=A?AB67778@+*)(*)(),/111966.(($$$%),.0)))(((+121113447889=;5;85555:=@A3=?9481&&%%%%%(-''())6=>>=>@C**)+.000//+,2;9CC98;;<?6).>JA:?87753:*))))*5////389;<567:66678;:@=@;94&%(34>99889888.%$$$$%&&&+11676(.75.'''''''''(738=73353=?CAAC99858:6<@:9?BED=7>6876769<<42'$####$%((,11123,+..07<<39<==76444//5775440231165<<:=8>6FGB884-*+**(%&(+/035666400011>>?@=>=<<;;::8//.../))-/,1((3379433-45*'&'(+66)('()2*$$#""###$$%&&%&&'*($%&&%%&%')--+++)())('**),+%%('$##$$%')&&&)'(&$###$&$##$#$%)%$###"##$$%')'%$ qs:i:9 du:f:5.71875 ns:i:22875 ts:i:0 mx:i:3 ch:i:88 st:Z:2024-02-27T06:33:59.408+00:00 rn:i:31356 fn:Z:output6.pod5 sm:f:79.4189 sd:f:19.3945 sv:Z:quantile dx:i:0 RG:Z:c884b8754e91b8b445f4f47f572cedd4a0678cca_rna004_130bps_sup@v3.0.1 MN:i:518 MM:Z:A+a?,10,13,43,19,19; ML:B:C,0,1,1,1,199

My questions are:

We are using the required rna004 kit for this model and running on converted pod5 files on a GPU server.

HalfPhoton commented 6 months ago

Here's a link to the SAM tags documentation

The definition of MM numeric values - emphasis on to skip

... comma separated list of how many seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, ...

MM:Z:A+a?,10 - skips 10 "fundamental" A bases - 19 appears twice out of coincidence here.

The definition of '?':

When this flag is ‘?’ there is no information about the modification status of the skipped bases provided.

Kind regards, Rich

ArtRand commented 6 months ago

Hello @sarah-ku,

The MM and ML tags are difficult to interpret by eye. I recommend using modkit extract (docs) to transform the MM/ML tags into a table. @HalfPhoton is correct about the MM tags being the number of "skips" not the actual positions, the ML scores are 0-255 probability bins, so 0 is the lowest probability of modification and it's reasonable for multiple positions to have the same prediction probability.