nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
136 stars 7 forks source link

Where does P(canonical) come from? #219

Open lkwhite opened 3 months ago

lkwhite commented 3 months ago

I think I get where the modification probabilities are coming from (ML values / 256). But if I pass a filter that's being used for both modified and canonical calls, where does the canonical base probability values come from? Are these converted from Phred-based QUAL scores in the BAM? Or in a single modification scenario, are they 1 - P(modified)?

This may be intuitive to folks who've been doing DNA methylation work for a while, but as an RNA biologist just starting to assess Dorado modification calls, I'm starting closer to square one here.

ArtRand commented 3 months ago

Hello @lkwhite,

Unfortunately how $P_{\text{canonical}}$ is calculated is buried in the SAM specification see page 9 under the ML tag specification.

Are these converted from Phred-based QUAL scores in the BAM? Or in a single modification scenario, are they 1 - P(modified)?

Correct so long as you are referring to the ML tag QUAL scores (not base base call QUAL scores).

More concretely, the ML tag has the probability the model assigned to the base having a certain modification. If there are multiple potential modified states, there will be multiple probabilities. The canonical probability is then

$$ P{\text{canonical}} = 1 - \sum{m \in \textbf{M}} P_{m} $$

where $\textbf{M}$ is the set of all of the potential modifications for the base. For example, if the $P{\text{m6A}} = 0.9$ then the probability of canonical $\text{A}$ is $P{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1$.

I've been meaning to write a "tl;dr" getting started with base modifications in the documentation so I'll be sure to add how these calculations are done.

lkwhite commented 3 months ago

Thanks @ArtRand for the quick feedback. I read through that section of the spec and then through various ModKit documentation and GH issues over the past week but it's a lot to take in, so some kind of 101 would be really useful.