nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
151 stars 8 forks source link

Detect methylation data background noise #147

Open Macdot3 opened 8 months ago

Macdot3 commented 8 months ago

Hi everyone, I have a question regarding the analysis of data derived from modkits. I am working with two sets of data (for the moment, a few samples): one concerning samples sequenced without PCR amplification using MinION r10, and another set of data derived from PCR and then sequenced using the same technology. Some samples are the same. I was wondering if there is a method (statistical or a pipeline, etc.) that would allow me to determine whether the methylation values obtained constitute background noise on the PCR samples, as I do not expect significant methylation on these. Additionally, if there is a way to establish a threshold for background noise. I have read a lot about it, but perhaps you can help me. Thank you all.

ArtRand commented 8 months ago

Hello @Macdot3,

This is a difficult question to answer in a very general case. A couple things you could try:

  1. Run modkit summary and see what the methylation levels are in the reads in the PCR and native samples. I would expect that the PCR samples have very low methylation levels <3% depending on the modification model. If you expect the native samples to contain a modification - the summary should report how often that modification is found in the reads. If the native DNA should not have the modification it will give you an estimate of the false positive rate.
  2. Run modkit dmr pair or modkit dmr multi comparing the PCR and native samples, see the documentation for details on these commands.

If you tell me the result of (1) above maybe I can give you additional advice.

Macdot3 commented 8 months ago

Hello Arthur, sorry for the delay in responding, thank you so much for your advice. I'm sending you my summary results separated by native samples and PCR in two files. You'll find the sample correspondence in the names. Only two of them don't have duplicates.

native.txt PCR.txt

From what you're telling me, should I consider the all_fraccolumn for the threshold, right? As for point 2, I'm working on it. My goal is primarily to determine whether, depending on the nanopore chemistry, I have methylation on PCR samples and thus determine biases related to the technique, and then subsequently perform analyses of differentially methylated regions. Could this be a sensible approach? I'm relying on your advice.

ArtRand commented 8 months ago

Hello @Macdot3,

First off, I would use the pass_frac column to check on modification levels. This is the fraction of each modification call after applying the thresholding.

A couple things stand out to me from your modkit summary reports:

  1. Some reports only have a few reads e.g. PCR_MT6869L_summary only uses 29 reads, re-run these with the --no-sampling flag or a higher --sampling-frac to collect statistics on more reads.
  2. All of the PCR runs have very low modification rates (generally less than 1%), this is good and what I would expect.
  3. Of the 3 native runs with >100 reads, the percentages look pretty similar - at a high level. If you're looking to find changes in methylation patterns you could try using modkit dmr on these samples.

Happy to answer any additional questions.

Macdot3 commented 8 months ago

Thank you, @ArtRand, for your invaluable support. Upon conducting a more detailed analysis, it appears that my native Linfo_ samples exhibit remarkably low coverage. This suggests they may not be optimal for methylation analysis on MtDNA, prompting us to internally reassess our approach. Interestingly, despite employing a- - no-sampling technique, I observe consistent read counts, significantly impacting the report's integrity and the accuracy of methylation percentage calculations.

Regarding thresholds, I have some inquiries. How does modkit discern, in its automated threshold determination, whether these samples are PCR? While it may seem trivial, I seek clarity to deepen my understanding. Additionally, in setting a summary threshold manually (excluding dorado), could you provide guidance pertinent to the ongoing experiment? Your insights would be greatly appreciated. Thank you once again for your assistance.

Macdot3 commented 8 months ago

I also have another question: how many CpG sites does Modkit calculate the threshold for and provide extract outputs? To clarify, I know that there are a total of 435 predicted CpG sites on the mtDNA, but the output from Modkit extract still provides many more sites than 435 on the ref_position. I'm not sure if I was clear. I appreciate your advice, thank you very much

ArtRand commented 8 months ago

Hello @Macdot3,

As far as the number of reads in the modkit summary report, if the --no-sampling flag is passed all of the reads in the BAM will be used. So if you have very few mapped reads in the BAM (mapped or unmapped) modkit will only be able to use the few that are there and it will not try and perform any imputation.

As for the threshold calculation, the exact calculation can be found in the documentation. The number of CpG sites used in the calculation isn't logged. I can add this in the next release. The number of reads is logged, however, use --log-filepath with any of the commands and it will report how many records were used. If you're worried that you're not getting enough modification sites to properly calculate the threshold increase the --num-reads option or set --sampling-frac to 1.0.

If you are using modkit extract without the --mapped-only flag you may get modification calls where a read has a CG but the reference is CH. One thing you could consider doing is using a BED file with the CpGs on just the mitochondrial chromosome, both extract and pileup will accept this option. Please refer to the program --help for the details for these options. Good luck, sorry about the slow response.

Macdot3 commented 7 months ago

Hello Arthur, I want to express my gratitude for your advice and for considering my suggestion. I wanted to update you on my project. I've calculated the methylation percentage of CpG islands separately on the positive and negative strands. I noticed that even on the PCR samples, I'm obtaining a very high distribution of percentages. I'm currently attempting to create a directory and use the modkit sample-probs command to generate histograms and output txt and tsv files, from which I'll manually select the threshold. To avoid having a lot of methylated bases for percentage calculation, which threshold do you suggest I work with: --filter-thresholdor --mod-threshold? Or should I increase the threshold from dorado basecall? These are my extract commands also: modkit extract --mapped-only --ref /home/rCRS_1624.fasta --include-bed /home/CG_motifs_PCR.bed /home/PCR_aligned_sorted_FORWARD.bam /home/PCR_FORWARD_extract.tsv

Macdot3 commented 7 months ago

@ArtRand , I've generated my probabilities.tsv and probabilities.txt files. I wanted to ask you for information on how sample-probs calculates the threshold. Does it possibly take into account the mod_qual values from the extract? Additionally, I'm getting a minimum value of 0.349. Can I consider this as the filter threshold, even though I have methylation values below this threshold? As always, I rely on your advice and look forward to your response. Thank you very much.

ArtRand commented 7 months ago

Hello @Macdot3,

The two options, --filter-threshold and --mod-threshold are described in the documentation. The details of how the pass threshold is estimated is also in the documentation.

In general, I would trust the filter threshold that is estimated by modkit. Could you send the output of modkit sample-probs? I would not set the threshold as the minimum value (0.349). If you're trying to get confident calls (modified or not), threshold values around 0.9 are the most common. If you post your probabilities.tsv I can take a look and confirm.

Macdot3 commented 7 months ago

Certainly, I'll attach my output files separately for the + and - strands of two native and PCR samples that I selected to estimate the threshold, considering the highest coverage. Sample_probs.zip My goal remains to determine the background noise of methylation calling in the PCR samples compared to the native ones, as I mentioned earlier if you remember. I want to see if by setting a threshold, I obtain something different from the previous data with modkit summary, as you advised me to do to estimate pass_frac fro noise. If you look at my files, I get a minimum value in the tail of 0.349 in both, and perhaps I got misled by this as the minimum threshold? Instead, in threshold.tsv, I get a value at the 10th percentile (which I understand is my pass_threshold from modkit summary, can you confirm?) of 0.91 for both. In addition, I wanted to ask you for some more information about the mod_qual column of modkit extract. I've read the documentation and was wondering how I could leverage it in my objective, in your opinion.
I always thank you so much for your advice and the time you dedicate to me. I'll be waiting.

ArtRand commented 7 months ago

Hello @Macdot3,

Thanks for sending these results over. The 10% threshold in all of the samples is ~0.9, which is what I would expect. The minimum value you're seeing is the minimum value bin. The file you're looking at is a histogram so what this means is you have at least 1 call with that probability. I'll clarify the documentation on the contents of these files. This is also what I would expect since there are 3 classes, so if the model cannot make a call it will have a probability of ~33% (random guess). The mod_qual is the probability the model has assigned to the call or modification code. If you're tying to determine false positive rates per strand by sequencing PCR samples, I would still recommend using modkit pileup then separating the (+) and (-) strand rows. The empirical estimate of the FPR is sum(N_mod) / sum(N_valid_coverage).

Macdot3 commented 7 months ago

Thank you very much, @ArtRand , for your advice. If you could clarify the documentation on this part, it would be fantastic, and I appreciate it. I'm running the analysis with pileupusing the threshold I estimated as you suggested, but with a different reference (only mtDNA) from the first time (hg38) where I found % values from pileup of only 0 and 100, as we discussed in a comment #133. I'll keep you updated on this to see what I get. Also, I wanted to ask you, in your algorithm for estimating the threshold, as q=argmax(P)does mod_qual perhaps correspond to P or Pm or is it somehow implicated in this q estimation? It seems to me that from the documentation and from how you explain it, they could be similar concepts, or are they different measures? Thank you very much for your clarification and for your help.

ArtRand commented 7 months ago

Hello @Macdot3,

The mod_qual is the probability of a specific modification (for example 5mC or 5hmC). Then the canonical probability is 1 - sum(P_mod). Please refer to section 1.7 of the SAMTags specification for details.