Closed Adinivich closed 2 months ago
Hello @Adinivich,
Since the thresholds change methylation status so much, what do you recommend as --filter-threshold and --mod-thresholds (if that even plays a role in combination with --filter-threshold)? Is there a standard threshold for base and/or modification? (NB I tried to run modkit sample-probs bam --no-sampling but it demanded too much memory)
In general we recommend using the estimated threshold values. An estimated threshold of 0.4 says to me that the model (I'm assuming 5hmC/5mC, three-mod model) isn't very confident in it's predictions. What you could do is increase the --filter-percentile
, the default is 0.1 meaning discard the 10% lowest confidence calls, you could try increasing the value to 0.15 - leading to the next point. If you run modkit sample-probs
with --no-sampling
it's going to churn through every read in the BAM, it shouldn't demand too much memory for this, but depending on the depth you have and the compute infrastructure I could see it being a problem. What I would do is look at a genome browser and run modkit sample-probs --region ${region} --hist --out-dir ${path_to_histograms}
where the region
is a section where you see likely methylation. Then I would inspect these probability distributions. I'd be happy to look at these with you. This will help guide which --filter-percentile
to use. What percent modification do you get with the estimated value (0.4)?
Do I find higher average methylation with --filter-threshold 0.9 than 0.7 because modified bases have a higher probability to pass?
Yes. We've seen with the lastest v5 models that the modified base probability is generally higher for 5mC than canonical bases. So what's happening when you increase --filter-threshold 0.9
, you're biasing towards keeping modified calls. (The next set of models will hopefully solve this problem).
Does an estimated threshold of 0.4 mean that modkit is only 40% sure of the base or is it more complicated than that?
Basically yes.
Circling back, if you're seeing the 10th percentile probability ~0.4 it sounds like there are a bunch of sites where the model isn't very confident in the modification status. This could happen for a variety of reasons. I think digging into the modification probabilities with sample-probs
or extract
while looking at the reads themselves is probably a good first step in exploratory data analysis.
Hello ArtRand,
Thanks for your thorough answer!
Out of curiosity I directly ran modkit pileup now with --filter-percentile 0.15
on the CG model (the lightest BAM) and now the threshold is 0.53125. Already higher than before (0.46679688)
The average methylation values for modkit pileup 1. without thresholds 2. --filter-threshold
(& --mod-thresholds
) 0.7 and 3. --filter-threshold
(& --mod-thresholds
) 0.9 are the following:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| no-threshold | 0.7 / 0.7 | 0.9 / 0.9 -- | -- | -- | -- 5mC | 17.9541 | 17.8887 | 57.0815 5hmC | 2.06652 | 1.61257 | 2.94464 5mCG | 24.6057 | 28.2148 | 77.698 5hmCG | 3.4495 | 2.77225 | 4.55419 6mA | 4.64895 | 4.44633 | 2.98148
Hi,
I am back for another question, since I got such excellent help last time :)
I am analysing the methylation of my Cytosines and trying to decide on a threshold. I am working with a species of green algae with no known methylation profile yet.
Running the program without the
--filter-threshold
flag, the default threshold is 0.4-0.5 depending on the model and modkit pileup reports that this is very low. However, I am a bit confused about which threshold to choose now.I set thresholds both for base (
--filter-threshold
) and for modification (--mod-thresholds
), I tried 0.7 and 0.9 by way of exploration. It seems that it is the--filter-threshold
that is limiting: by increasing the--filter-threshold
I also increase average methylation while by increasing the--mod-thresholds
I get lower average methylation, while a combination of the 2 flags gives me the same high values as the--filter-threshold
only. The results I get with--filter-threshold
0.7 seem reasonable compared to other green algae (ex: 28% methylation on 5mCG) while with--filter-threshold
0.9 I get extreme values (ex: 78% methylation on 5mCG). However, methylation in this species being unknown, it could be 78% for all I know.Thus I have some questions, the first of which is the more important:
--filter-threshold
and--mod-thresholds
(if that even plays a role in combination with--filter-threshold
)? Is there a standard threshold for base and/or modification? (NB I tried to runmodkit sample-probs bam --no-sampling
but it demanded too much memory)--filter-threshold
0.9 than 0.7 because modified bases have a higher probability to pass?As a side note, basecalling was done with Dorado 0.7.2 which already has its own basecalling threshold if I am not mistaken, although I am not sure how high this threshold is.
Thanks in advance!