nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
116 stars 6 forks source link

Does pileup always output a value for a covered base? #200

Closed Ge0rges closed 3 weeks ago

Ge0rges commented 3 weeks ago

Hi,

Just a simple question (for once), I wanted to ensure I understand the output of pileup correctly. For any given location/nucleotide, if there is no methylation but there is coverage of that base, will N canonical always be non-zero?

ArtRand commented 3 weeks ago

Hello @Ge0rges,

Any position with valid coverage >0 will have a bedMethyl record.

For any given location/nucleotide, if there is no methylation but there is coverage of that base, will N canonical always be non-zero?

I would say "For any given location/nucleotide, if there is no methylation but there is coverage of that base and the modification call passes the confidence threshold, will N canonical always be non-zero?"

Let me know if you find this to not be the case.

Ge0rges commented 3 weeks ago

So just to double check, even if a base is not methylated but covered it could not show up if the modification call isn't above the threshold?

Would it be incorrect to treat covered bases below the threshold as unmethylated? I guess unknown is different that unmethylated...

ArtRand commented 3 weeks ago

@Ge0rges

A potential confusion is that modkit does not assume a base is canonical if the probability of modification is close to $\frac{1}{N_{\text{classes}}}$. A modification call can be canonical or any of the potential modification codes/states. So if a base is covered, but the call is canonical, there is a confidence/probability associated with that canonical call. If that canonical probability is below the pass threshold, it will be filtered out.

For example (take a simple 5mC/canonical example):

$$ P{\text{5mC}} = 0.7, P{\text{5hmC}} = 0.2, P{\text{canonical}} = 1 - P{\text{5mC}} + P_{\text{5hmC}} = 0.1, \text{Call} \Rightarrow \text{5mC} $$

However if the threshold is 0.9, then this call will fail not be considered canonical. There are more examples in the documentation.

If a read is filtered out, it adds a count to column 16 not $N{\text{valid cov}}$. If a row has $N{\text{valid cov}}$ of zero, it's not included in the output. You can see this easily when you run modkit pileup with --no-filtering and --filter-threshold 0.95 (or higher) and count the rows.

You can, however, force a default to canonical call by setting the thresholds specifically for each modified base --mod-threshold h:0.95 --mod-threshold m:0.95 --filter-threshold C:0.05 in this case what happens is the canonical probability (0.1) is still above the requirement for the base (0.05) and a canonical call is emitted. Using the thresholds this way is only advised when you have a very strong prior that bases should be canonical unless there is strong evidence of modification.