Which thresholds to choose?

Adinivich commented 3 months ago

Hi,

I am back for another question, since I got such excellent help last time :)

I am analysing the methylation of my Cytosines and trying to decide on a threshold. I am working with a species of green algae with no known methylation profile yet.

Running the program without the --filter-threshold flag, the default threshold is 0.4-0.5 depending on the model and modkit pileup reports that this is very low. However, I am a bit confused about which threshold to choose now.

I set thresholds both for base (--filter-threshold) and for modification (--mod-thresholds), I tried 0.7 and 0.9 by way of exploration. It seems that it is the --filter-threshold that is limiting: by increasing the --filter-threshold I also increase average methylation while by increasing the --mod-thresholds I get lower average methylation, while a combination of the 2 flags gives me the same high values as the --filter-threshold only. The results I get with --filter-threshold 0.7 seem reasonable compared to other green algae (ex: 28% methylation on 5mCG) while with --filter-threshold 0.9 I get extreme values (ex: 78% methylation on 5mCG). However, methylation in this species being unknown, it could be 78% for all I know.

Thus I have some questions, the first of which is the more important:

Since the thresholds change methylation status so much, what do you recommend as --filter-threshold and --mod-thresholds (if that even plays a role in combination with --filter-threshold)? Is there a standard threshold for base and/or modification? (NB I tried to run modkit sample-probs bam --no-sampling but it demanded too much memory)
Do I find higher average methylation with --filter-threshold 0.9 than 0.7 because modified bases have a higher probability to pass?
Does an estimated threshold of 0.4 mean that modkit is only 40% sure of the base or is it more complicated than that?

As a side note, basecalling was done with Dorado 0.7.2 which already has its own basecalling threshold if I am not mistaken, although I am not sure how high this threshold is.

Thanks in advance!

ArtRand commented 3 months ago

Hello @Adinivich,

Since the thresholds change methylation status so much, what do you recommend as --filter-threshold and --mod-thresholds (if that even plays a role in combination with --filter-threshold)? Is there a standard threshold for base and/or modification? (NB I tried to run modkit sample-probs bam --no-sampling but it demanded too much memory)

In general we recommend using the estimated threshold values. An estimated threshold of 0.4 says to me that the model (I'm assuming 5hmC/5mC, three-mod model) isn't very confident in it's predictions. What you could do is increase the --filter-percentile, the default is 0.1 meaning discard the 10% lowest confidence calls, you could try increasing the value to 0.15 - leading to the next point. If you run modkit sample-probs with --no-sampling it's going to churn through every read in the BAM, it shouldn't demand too much memory for this, but depending on the depth you have and the compute infrastructure I could see it being a problem. What I would do is look at a genome browser and run modkit sample-probs --region ${region} --hist --out-dir ${path_to_histograms} where the region is a section where you see likely methylation. Then I would inspect these probability distributions. I'd be happy to look at these with you. This will help guide which --filter-percentile to use. What percent modification do you get with the estimated value (0.4)?

Do I find higher average methylation with --filter-threshold 0.9 than 0.7 because modified bases have a higher probability to pass?

Yes. We've seen with the lastest v5 models that the modified base probability is generally higher for 5mC than canonical bases. So what's happening when you increase --filter-threshold 0.9, you're biasing towards keeping modified calls. (The next set of models will hopefully solve this problem).

Does an estimated threshold of 0.4 mean that modkit is only 40% sure of the base or is it more complicated than that?

Basically yes.

Circling back, if you're seeing the 10th percentile probability ~0.4 it sounds like there are a bunch of sites where the model isn't very confident in the modification status. This could happen for a variety of reasons. I think digging into the modification probabilities with sample-probs or extract while looking at the reads themselves is probably a good first step in exploratory data analysis.

Adinivich commented 3 months ago

Hello ArtRand,

Thanks for your thorough answer!

Out of curiosity I directly ran modkit pileup now with --filter-percentile 0.15 on the CG model (the lightest BAM) and now the threshold is 0.53125. Already higher than before (0.46679688)

The average methylation values for modkit pileup 1. without thresholds 2. --filter-threshold (& --mod-thresholds) 0.7 and 3. --filter-threshold (& --mod-thresholds) 0.9 are the following:

| no-threshold | 0.7 / 0.7 | 0.9 / 0.9 -- | -- | -- | -- 5mC | 17.9541 | 17.8887 | 57.0815 5hmC | 2.06652 | 1.61257 | 2.94464 5mCG | 24.6057 | 28.2148 | 77.698 5hmCG | 3.4495 | 2.77225 | 4.55419 6mA | 4.64895 | 4.44633 | 2.98148

As you can see, a threshold of 0.7 doesn't change the values drastically.

I also ran the modkit sample-probs command as you suggested: modkit sample-probs $path_to_bam/Pc-5mC_5hmC.merged.sorted.bam --region Chr18 --hist --out-dir $path_to_out I chose Chr18 because it is not too big and contains chunks of clear methylation, as you can see on this example screenshot: Screenshot 2024-08-13 at 15 13 11 (2)

Here are the sample-probs output files: probabilities_Chr18.tsv.txt probabilities_Chr18.txt thresholds_Chr18.tsv.txt

It would be great if you could take a look, I am not sure how to interpret these results, especially the probabilities files. To me, looking at probabilies_Chr18.txt, it would seem logical to very grossly take 0.55-0.6 for code h, 0.5 for code m except and 0.85 for code C as thresholds, would you agree? Or am I looking at it the wrong way?

Thank you again!

ArtRand commented 2 months ago

Hello @Adinivich,

Thanks for sending those data over. The distributions of output probabilities in your data is indeed a little difficult to interpret. The latest version of modkit (v0.3.2) has an updated sample-probs command that can generate more informative tables. Mainly, it can give you the percentile rank of each probability bin to make it easier to determine per-modification thresholds. The documentation is in the usual place. I don't see why you can't pick the 10th percentile threshold for each modification and use that.

ArtRand commented 2 months ago

@Adinivich feel free to re-open this issue if you have additional questions or have difficulty running the new sample-probs command.

Adinivich commented 2 months ago

Hello ArtRand,

Thank you for your message. I needed to send the results before leaving on holidays, so finally I decided on reporting a range of methylation for each context, between filter thresholds 0.51 and 0.7 (not a huge difference in methylation %). The most important was to see if my species methylates and I feel comfortable to say that yes, it does.

Thanks again for all your help and great toolkit! Lisa

On Wed, 28 Aug 2024 at 01:49, Arthur Rand @.***> wrote:

@Adinivich https://github.com/Adinivich feel free to re-open this issue if you have additional questions or have difficulty running the new sample-probs command.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/modkit/issues/245#issuecomment-2312780768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHO3W7PPO6QTLBD3IO4SQKTZTSGX3AVCNFSM6AAAAABMG2SWQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJSG44DANZWHA . You are receiving this because you were mentioned.Message ID: @.***>

nanoporetech / modkit

Which thresholds to choose? #245