Open aleighbrown opened 3 days ago
Also exploring the probability distributions with modkit sample-probs
- here I'm just showing the cumulative probability for our 2 conditions, 3 samples each, and there's a systemic difference between our conditions for the canonical A and the modified A - this seems to be problematic to me. Is there an obvious reason we might observe this either technically?
Hello @aleighbrown,
Could you let me know which m6A model you're using (which version)? If you used the automatic model selection, I need the dorado version.
Thanks for this analysis, looking for the proportion of 6mA calls within DRACH motifs is a nice sanity check (I'm assuming you ran the "all-context" m6A model).
As far as the motif searching, could you run modkit motif evaluate
(documentation is here)? You can use --known-motif DRACH 2
instead of providing a whole table. I suspect that you have low levels of m6A in your sample and the default settings aren't picking it up. The default settings are mostly set up for bacterial motif searching.
Let me take a look at the output probabilities on some data I have and circle back. My suspicion is that the sample has low m6A and the ECDF is essentially dominated by low confidence false positive calls.
Yes we ran the "all-context" m6A model on these samples.
We ran this actually in 2 sets, one sample (Knockdown 1 in this graph) was run with dorodo version number 0.7.3+6e6c45c CL:dorado basecaller sup,m6A and the other 5 were run with dorodo version 0.8.1+c3a2952 CL:dorado basecaller sup@v5.0.0,m6A
You can see from the above graph that even at the higher m6a probabilities there's a slight reduction in high probability m6a calls for the sample run on the earlier dorodo version so we're actually planning to recall this data anyhow with dorodo v 0.8.3 with the dorado_model sup@v5.1.0,m6A_DRACH basecalling models to maintain consistency across the samples.
We have some orthogonal reasons to suspect that our knockdown should reduce overall levels of RNA m6a - would this affect the whole probability estimation in someway that I'm not grokking?
I'll report back with the results on the newer model run.
So motif evaluate
gives us results like this - with the default settings for --low-thresh 0.2 and --high-thresh 0.6
evaluated motifs:
+---------+------------+------------+-----------+-----------+----------+
| motif | frac_mod | high_count | low_count | mid_count | log_odds |
+=========+============+============+===========+===========+==========+
| DR[a]CH | 0.04059856 | 58296 | 1377617 | 88636 | 8.187913 |
+---------+------------+------------+-----------+-----------+----------+
What I was hoping to accomplish with motif search
was to see if the methylated motifs are changing between our control and KD (more specifically I was curious to know if higher strength DRACH motifs were less affected by KD than lower strength ones), I wonder if rather than using motif search
to find de-novo motifs I could instead input all the DRACH motifs into motif evaluate
and compare the rates there...
Also do these rates seem...reasonable for a sample? It's hard for me to find information published on direct-RNA seq which is comparable
Just a question about motif searching and data quality.
Our data is called using dorodo and we called the bedmethyls with modkit pileup
We tried using the defaults but consistently got output like the following in the log file
So we've set the A and a thresholds as follows:
Which seemed to give decent results, e.g. when I manually checked how many modified sites were called inside DRACH motifs using the higher thresholds this pattern/fraction seemed to make sense:
Versus the same result running the pileup using the default filtering thresholds, while many more sites reported - much lower proportion of those are inside canonical DRACH motifs.
However the issues appear when we start trying to use the motif search function
Running
motif search
using the following parametersThe motif search runs for hours and produces results like the following:
My questions are thus:
1: Are the stringency settings on modkit pileup perhaps too high, resulting in fewer sites being called? 2: Does the fact that we had to set the settings so high say something about the raw data quality that I might be missing?
I also realize that if I want the motif search to finish I can play around more with the
--low-thresh
--high-thresh
parameters, but I'm not sure if that's something which makes sense here, or if instead I should take the lowish m6a rate as a sign of something a miss with an earlier step in data processing.We're the first lab in the department to do this kind of direct RNA sequencing + the first one to try it with the sequencing facility so I'm just a bit confused as to what these results might mean re: quality of our data.
Thank you for the help!