Open Macdot3 opened 8 months ago
Hello @Macdot3,
This is a difficult question to answer in a very general case. A couple things you could try:
modkit summary
and see what the methylation levels are in the reads in the PCR and native samples. I would expect that the PCR samples have very low methylation levels <3% depending on the modification model. If you expect the native samples to contain a modification - the summary should report how often that modification is found in the reads. If the native DNA should not have the modification it will give you an estimate of the false positive rate.modkit dmr pair
or modkit dmr multi
comparing the PCR and native samples, see the documentation for details on these commands. If you tell me the result of (1) above maybe I can give you additional advice.
Hello Arthur, sorry for the delay in responding, thank you so much for your advice. I'm sending you my summary results separated by native samples and PCR in two files. You'll find the sample correspondence in the names. Only two of them don't have duplicates.
From what you're telling me, should I consider the all_frac
column for the threshold, right? As for point 2, I'm working on it. My goal is primarily to determine whether, depending on the nanopore chemistry, I have methylation on PCR samples and thus determine biases related to the technique, and then subsequently perform analyses of differentially methylated regions.
Could this be a sensible approach? I'm relying on your advice.
Hello @Macdot3,
First off, I would use the pass_frac
column to check on modification levels. This is the fraction of each modification call after applying the thresholding.
A couple things stand out to me from your modkit summary
reports:
--no-sampling
flag or a higher --sampling-frac
to collect statistics on more reads.modkit dmr
on these samples.Happy to answer any additional questions.
Thank you, @ArtRand, for your invaluable support. Upon conducting a more detailed analysis, it appears that my native Linfo_ samples exhibit remarkably low coverage.
This suggests they may not be optimal for methylation analysis on MtDNA, prompting us to internally reassess our approach. Interestingly, despite employing a- - no-sampling
technique, I observe consistent read counts, significantly impacting the report's integrity and the accuracy of methylation percentage calculations.
Regarding thresholds, I have some inquiries. How does modkit discern, in its automated threshold determination, whether these samples are PCR? While it may seem trivial, I seek clarity to deepen my understanding. Additionally, in setting a summary threshold manually (excluding dorado), could you provide guidance pertinent to the ongoing experiment? Your insights would be greatly appreciated. Thank you once again for your assistance.
I also have another question: how many CpG sites does Modkit calculate the threshold for and provide extract outputs? To clarify, I know that there are a total of 435 predicted CpG sites on the mtDNA, but the output from Modkit extract still provides many more sites than 435 on the ref_position. I'm not sure if I was clear. I appreciate your advice, thank you very much
Hello @Macdot3,
As far as the number of reads in the modkit summary
report, if the --no-sampling
flag is passed all of the reads in the BAM will be used. So if you have very few mapped reads in the BAM (mapped or unmapped) modkit
will only be able to use the few that are there and it will not try and perform any imputation.
As for the threshold calculation, the exact calculation can be found in the documentation. The number of CpG sites used in the calculation isn't logged. I can add this in the next release. The number of reads is logged, however, use --log-filepath
with any of the commands and it will report how many records were used. If you're worried that you're not getting enough modification sites to properly calculate the threshold increase the --num-reads
option or set --sampling-frac
to 1.0.
If you are using modkit extract
without the --mapped-only
flag you may get modification calls where a read has a CG but the reference is CH. One thing you could consider doing is using a BED file with the CpGs on just the mitochondrial chromosome, both extract
and pileup
will accept this option. Please refer to the program --help
for the details for these options. Good luck, sorry about the slow response.
Hello Arthur,
I want to express my gratitude for your advice and for considering my suggestion. I wanted to update you on my project. I've calculated the methylation percentage of CpG islands separately on the positive and negative strands. I noticed that even on the PCR samples, I'm obtaining a very high distribution of percentages. I'm currently attempting to create a directory and use the modkit sample-probs
command to generate histograms and output txt and tsv files, from which I'll manually select the threshold. To avoid having a lot of methylated bases for percentage calculation, which threshold do you suggest I work with: --filter-threshold
or --mod-threshold
? Or should I increase the threshold from dorado basecall? These are my extract commands also:
modkit extract --mapped-only --ref /home/rCRS_1624.fasta --include-bed /home/CG_motifs_PCR.bed /home/PCR_aligned_sorted_FORWARD.bam /home/PCR_FORWARD_extract.tsv
@ArtRand , I've generated my probabilities.tsv and probabilities.txt files. I wanted to ask you for information on how sample-probs
calculates the threshold. Does it possibly take into account the mod_qual
values from the extract? Additionally, I'm getting a minimum value of 0.349. Can I consider this as the filter threshold, even though I have methylation values below this threshold? As always, I rely on your advice and look forward to your response. Thank you very much.
Hello @Macdot3,
The two options, --filter-threshold
and --mod-threshold
are described in the documentation. The details of how the pass threshold is estimated is also in the documentation.
In general, I would trust the filter threshold that is estimated by modkit
. Could you send the output of modkit sample-probs
? I would not set the threshold as the minimum value (0.349). If you're trying to get confident calls (modified or not), threshold values around 0.9 are the most common. If you post your probabilities.tsv
I can take a look and confirm.
Certainly, I'll attach my output files separately for the + and - strands of two native and PCR samples that I selected to estimate the threshold, considering the highest coverage.
Sample_probs.zip
My goal remains to determine the background noise of methylation calling in the PCR samples compared to the native ones, as I mentioned earlier if you remember. I want to see if by setting a threshold, I obtain something different from the previous data with modkit summary
, as you advised me to do to estimate pass_frac
fro noise. If you look at my files, I get a minimum value in the tail of 0.349 in both, and perhaps I got misled by this as the minimum threshold? Instead, in threshold.tsv
, I get a value at the 10th percentile (which I understand is my pass_threshold
from modkit summary
, can you confirm?) of 0.91 for both. In addition, I wanted to ask you for some more information about the mod_qual
column of modkit extract. I've read the documentation and was wondering how I could leverage it in my objective, in your opinion.
I always thank you so much for your advice and the time you dedicate to me. I'll be waiting.
Hello @Macdot3,
Thanks for sending these results over. The 10% threshold in all of the samples is ~0.9, which is what I would expect. The minimum value you're seeing is the minimum value bin. The file you're looking at is a histogram so what this means is you have at least 1 call with that probability. I'll clarify the documentation on the contents of these files. This is also what I would expect since there are 3 classes, so if the model cannot make a call it will have a probability of ~33% (random guess). The mod_qual
is the probability the model has assigned to the call or modification code. If you're tying to determine false positive rates per strand by sequencing PCR samples, I would still recommend using modkit pileup
then separating the (+) and (-) strand rows. The empirical estimate of the FPR is sum(N_mod) / sum(N_valid_coverage).
Thank you very much, @ArtRand , for your advice.
If you could clarify the documentation on this part, it would be fantastic, and I appreciate it. I'm running the analysis with pileup
using the threshold I estimated as you suggested, but with a different reference (only mtDNA) from the first time (hg38) where I found % values from pileup of only 0 and 100, as we discussed in a comment #133. I'll keep you updated on this to see what I get. Also, I wanted to ask you, in your algorithm for estimating the threshold, as q=argmax(P)
does mod_qual
perhaps correspond to P or Pm or is it somehow implicated in this q estimation? It seems to me that from the documentation and from how you explain it, they could be similar concepts, or are they different measures?
Thank you very much for your clarification and for your help.
Hi everyone, I have a question regarding the analysis of data derived from modkits. I am working with two sets of data (for the moment, a few samples): one concerning samples sequenced without PCR amplification using MinION r10, and another set of data derived from PCR and then sequenced using the same technology. Some samples are the same. I was wondering if there is a method (statistical or a pipeline, etc.) that would allow me to determine whether the methylation values obtained constitute background noise on the PCR samples, as I do not expect significant methylation on these. Additionally, if there is a way to establish a threshold for background noise. I have read a lot about it, but perhaps you can help me. Thank you all.