Setting appropriate thresholds for calling m6A and pseU using modkit pileup - What to make of high Nfail numbers?

Hello,

After spending a significant amount of time going through the various modkit threads on how best to set the mod-thresholds and filter-thresholds values, I am still uncertain as to the underlying logic.

Using a standard RNA004 DRS dataset that was basecalled for m6A with Dorado v0.7.0 (sup mode, all context m6A), I ran modkit sample-probs and plotted the output of the threshold.tsv file to get the following.

From this I would infer the following parameters when running modkit pileup: --filter-threshold A:0.85 --mod-thresholds a:0.985

This seems to work well enough but when I apply this other datasets using the same logic (i.e. plotting) to set these cutoffs, I find the underlying distribution of m6A can look very different.

Keeping the same thresholds as above produces an output where the Nfail counts are very high (often as high or higher than the valid_cov). I am loathe to change the thresholds to anything lower because in IVT datasets I see the same distribution for m6A except for the small peak at > 0.98

I interpret all this as the model simply being unable to classify a given basecall as an A or m6A and reporting it as a third state (unknown) that is not included in any downstream calculations (e.g. stochiometry).

My first question then is quite simply, am I understanding this correctly?

My second question is that stochiometry calculations are quite important and I am not sure if I understand the logic of excluding the Nfail counts from this. Excluding these surely relies on the assumption that their 'unknown' state reflects an equal chance of them being m6A or A but it is not clear to me that this is the case, especially as m6A is considerably rarer than A. In other words, is there not a risk that excluding Nfail from the stoichiometry calculation is artificially inflating the stochiometry estimates (i.e. if the majority of Nfail bases are more likely to be A than m6A?

Hello @DepledgeLab,

I interpret all this as the model simply being unable to classify a given basecall as an A or m6A and reporting it as a third state (unknown) that is not included in any downstream calculations (e.g. stochiometry).

This isn't quite correct, the model outputs a probability of each class (m6A and canonical) I would interpret low probability calls as low confidence, not a third state. I appreciate that maybe this is is a trivial detail.

To me the second distribution you've shown here looks like there are very few m6A residues in the sample and the distribution is dominated by the false positive probabilities.

Excluding these surely relies on the assumption that their 'unknown' state reflects an equal chance of them being m6A or A but it is not clear to me that this is the case, especially as m6A is considerably rarer than A.

Assigning a modification state to calls that fail to pass the thresholds is challenging and depends on the experimental setup. As you've mentioned, assigning a 50% probability to m6A and A will probably bias your estimate. One thing you could look at is the number of m6A calls versus canonical A calls (use modkit summary). Without knowing more of the experimental details, it's hard for me to give a strong recommendation.

nanoporetech / modkit

Setting appropriate thresholds for calling m6A and pseU using modkit pileup - What to make of high Nfail numbers? #255