nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
439 stars 53 forks source link

Question regarding the new dna_r10.4.1_e8.2_400bps_sup@v5.0.0 Model #882

Open AzlanNI opened 2 weeks ago

AzlanNI commented 2 weeks ago

Hello Everyone!

I saw that dorado now has a model which detects 4mC and 5mC next to 6mA. So i wanted to try this on my Data and then use Modkit to sum up the results. So my Question now is if i use dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v1 as basecalling modell will it just sum up the Me calls for Cs. So if i have for instance 20% reads with an 5mC and 30% with 30% would it show for this particular position 50% ? Or would i have two columns with the same Position and 20% for 5mC and 30% for 4mC ?

Thanks u for ur help in advance and kind regards, Azlan

ArtRand commented 2 weeks ago

Hello @AzlanNI,

When you use the dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC modified base model, every sequencing read C will have an associated probability of 5mC, 4mC and canonical ($1-p{\text{5mC}} - p{\text{4mC}}$). Then when you use modkit pileup (or modkit extract with --read-calls) these probabilities are converted into base modification "calls" (i.e. classifications) based on the filtering algorithm, which may seem complicated but under most circumstances just picks the modification state with the highest probability, or filters our that site if the probability isn't high enough because the model isn't confident in the prediction. When you use modkit pileup the resulting table will count the number of reads that had each modification state at each genomic position, the schema for the table is in the modkit documentation and is nicely compatible with most genome viewers. I hope this answers your question, please let me know if it doesn't.