nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
151 stars 8 forks source link

What does it say about our data that the m6a RNA modification rates seem somewhat low? #307

Open aleighbrown opened 3 days ago

aleighbrown commented 3 days ago

Just a question about motif searching and data quality.

Our data is called using dorodo and we called the bedmethyls with modkit pileup

We tried using the defaults but consistently got output like the following in the log file

[src/reads_sampler/mod.rs::124][2024-11-19 13:53:03][DEBUG] sampled 7874 records
[src/command_utils.rs::121][2024-11-19 13:53:04][DEBUG] estimated pass threshold 0.6894531 for primary sequence base A
[src/pileup/subcommand.rs::628][2024-11-19 13:53:04][WARN] Threshold of 0.6894531 for base A is low. Consider increasing the filter-percentile or specifying a higher threshold.

So we've set the A and a thresholds as follows:


        modkit pileup {input.sorted_bam} {output} \
        --threads {threads} \
        --log-filepath {log}
        {params.tool_path} pileup {input.sorted_bam} {output} \
        --threads {threads} \
        --log-filepath {log} \
        --filter-threshold 0.7 \
        --mod-thresholds a:0.98

Which seemed to give decent results, e.g. when I manually checked how many modified sites were called inside DRACH motifs using the higher thresholds this pattern/fraction seemed to make sense: Screenshot 2024-11-27 at 10 00 25

Versus the same result running the pileup using the default filtering thresholds, while many more sites reported - much lower proportion of those are inside canonical DRACH motifs.

Screenshot 2024-11-27 at 10 00 40

However the issues appear when we start trying to use the motif search function

Running motif search using the following parameters

        modkit motif search \
        --in-bedmethyl {input} \
        --ref {params.ref} \
        -o {output} \
        --threads {threads} \
        --log {log} \
        --min-frac-mod 0.5 \
        --known-motif DRACH 2 a 

The motif search runs for hours and produces results like the following:

[src/find_motifs/mod.rs::1327][2024-11-26 16:40:51][INFO] loaded 20610765 low-frequency, 107149 middle-frequency and 60992 high-frequency contexts, 1 modification codes, discarded 20 contexts.
[src/find_motifs/mod.rs::736][2024-11-26 16:41:07][DEBUG] inferred A associated with modification code a
[src/find_motifs/subcommand.rs::206][2024-11-26 16:41:07][INFO] parsed 1 known motifs DR[a]CH
[src/find_motifs/mod.rs::2329][2024-11-26 16:41:11][DEBUG] at start for mod a there are 62687 high contexts and 21667461 low contexts
[src/find_motifs/mod.rs::2097][2024-11-26 16:41:27][DEBUG] there are 10 enriched kmers for mod a
[src/find_motifs/mod.rs::1918][2024-11-26 16:41:50][DEBUG] no changes, finished refining RR[a]CT into RR[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:41:53][DEBUG] no changes, finished refining DR[a]CW into CNNDR[a]CW, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:41:54][DEBUG] no changes, finished refining DR[a]C into DR[a]CGNT, took 0 iterations
[src/find_motifs/mod.rs::2243][2024-11-26 16:41:55][DEBUG] discarding RR[a]CT, fraction modified too low, 0.109630175
[src/find_motifs/mod.rs::2243][2024-11-26 16:41:55][DEBUG] discarding DR[a]CGNT, fraction modified too low, 0.034914833
[src/find_motifs/mod.rs::2243][2024-11-26 16:41:55][DEBUG] discarding CNNDR[a]CW, fraction modified too low, 0.0670616
[src/find_motifs/mod.rs::2270][2024-11-26 16:41:55][DEBUG] zero refined motifs, finished in 1 iterations
[src/find_motifs/mod.rs::2348][2024-11-26 16:41:56][DEBUG] (a) seeded motifs didn't find any motifs, took 44164ms
[src/find_motifs/mod.rs::1918][2024-11-26 16:42:27][DEBUG] no changes, finished refining [a] into CNNNNNNGNNGG[a]CW, took 0 iterations
[src/find_motifs/mod.rs::2401][2024-11-26 16:42:28][DEBUG] (a) done searching for seedless motifs, found 1
[src/find_motifs/mod.rs::2411][2024-11-26 16:42:28][DEBUG] performing search
[src/find_motifs/mod.rs::603][2024-11-26 16:42:38][DEBUG] creating bool tables took 3051859 ms
[src/find_motifs/mod.rs::1918][2024-11-26 16:45:59][DEBUG] no changes, finished refining GNNNANNNNNNN[a]NNA into GNNNANNNNNGG[a]NNA, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:46:03][DEBUG] no changes, finished refining ANNNNNNNANN[a]NC into ANNNNNNNANG[a]CC, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:46:07][DEBUG] no changes, finished refining ANNNNNNN[a]NNANNNNNNC into ANNNNNGG[a]NNANNNNNNC, took 0 iterations
[src/find_motifs/mod.rs::2540][2024-11-26 16:46:08][DEBUG] discarding GNNNANNNNNGG[a]NNA (search), high-modified sites too low, 250
[src/find_motifs/mod.rs::2540][2024-11-26 16:46:11][DEBUG] discarding ANNNNNNNANG[a]CC (search), high-modified sites too low, 297
[src/find_motifs/mod.rs::2540][2024-11-26 16:46:16][DEBUG] discarding ANNNNNGG[a]NNANNNNNNC (search), high-modified sites too low, 246
[src/find_motifs/mod.rs::1918][2024-11-26 16:46:56][DEBUG] no changes, finished refining GGNNNNNNNNNG[a] into GGNNNNNNNNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:46:56][DEBUG] no changes, finished refining GNGNNNNNNNNA[a] into GNGNNNNNNNGA[a]CT, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:05][DEBUG] discarding GNGNNNNNNNGA[a]CT (search), fraction modified too low, 0.13379501
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:07][DEBUG] discarding GGNNNNNNNNGG[a]CT (search), fraction modified too low, 0.24454316
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:25][DEBUG] no changes, finished refining AGANNNNNNNNN[a] into AGANNNNNNNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:27][DEBUG] no changes, finished refining ANCNNNNGNNNN[a] into ANCNNNNGNNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:29][DEBUG] no changes, finished refining AGGNNNNNNNNN[a] into AGGNNNNNNNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:32][DEBUG] no changes, finished refining GGNNNNNNCNNN[a] into GGNNNNNNCNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:32][DEBUG] no changes, finished refining GANNGNNNNNNN[a] into GANNGNNNNNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:33][DEBUG] no changes, finished refining GCNNNNNNNNNN[a]NNNNNA into GCNNNNNNNNGG[a]CTNNNA, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:34][DEBUG] discarding AGANNNNNNNGG[a]CT (search), fraction modified too low, 0.2525773
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:35][DEBUG] no changes, finished refining GNTNNNGNNNNN[a] into GNTNNNGNNNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::2540][2024-11-26 16:47:36][DEBUG] discarding ANCNNNNGNNGG[a]CT (search), high-modified sites too low, 247
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:38][DEBUG] no changes, finished refining AGNNNNNNCNNN[a] into AGNNNNNNCNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:38][DEBUG] no changes, finished refining GNNNNNNN[a]NNNGNNNNNNC into GNNNNNGG[a]CTNGNNNNNNC, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:39][DEBUG] discarding GGNNNNNNCNGG[a]CT (search), fraction modified too low, 0.27569443
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:40][DEBUG] discarding AGGNNNNNNNGG[a]CT (search), fraction modified too low, 0.26079243
[src/find_motifs/mod.rs::2540][2024-11-26 16:47:40][DEBUG] discarding GANNGNNNNNGG[a]CT (search), high-modified sites too low, 235
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:40][DEBUG] no changes, finished refining AANNNNNNNNNN[a]NNNNNNNG into AANNNNNNNNGG[a]CTNNNNNG, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:41][DEBUG] discarding GCNNNNNNNNGG[a]CTNNNA (search), fraction modified too low, 0.31734693
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:41][DEBUG] no changes, finished refining TNNANANNNNNN[a] into TNNANANNNNGG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:41][DEBUG] no changes, finished refining ACNANNNNNNNN[a] into ACNANNNNNNRG[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:41][DEBUG] no changes, finished refining AANNNNNNNNN[a]NNNNNNC into AANNNNNNNGG[a]CTNNNNC, took 0 iterations
[src/find_motifs/mod.rs::2540][2024-11-26 16:47:43][DEBUG] discarding GNTNNNGNNNGG[a]CT (search), high-modified sites too low, 213
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:43][DEBUG] no changes, finished refining AANNNNN[a]NNNNNNNG into AANNNGG[a]CTNNNNNG, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:44][DEBUG] no changes, finished refining ANNN[a]NNANNNC into ANGG[a]CTANNNC, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:45][DEBUG] discarding AGNNNNNNCNGG[a]CT (search), fraction modified too low, 0.2720588
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:46][DEBUG] no changes, finished refining GTNNANNNNNNN[a] into GTNNANNNNNGR[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:46][DEBUG] no changes, finished refining CTNNNNNNNNNN[a]NNNA into CTNNNNNNNNGG[a]CWNA, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:46][DEBUG] discarding GNNNNNGG[a]CTNGNNNNNNC (search), fraction modified too low, 0.2554622
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:47][DEBUG] no changes, finished refining GANNTNNNNNNN[a] into GANNTNNNNNGG[a]CW, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:47][DEBUG] no changes, finished refining AANNNNNNNNNN[a]NNNA into AANNNNNNNNGG[a]CWNA, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:48][DEBUG] discarding TNNANANNNNGG[a]CT (search), fraction modified too low, 0.23417722
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:49][DEBUG] no changes, finished refining ANNNNNNN[a]NNGNNNNNNNNA into ANNNNNGR[a]CTGNNNNNNNNA, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:49][DEBUG] no changes, finished refining ANANNNCNNN[a] into ANANNNCNGR[a]CT, took 0 iterations
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:49][DEBUG] no changes, finished refining ATNNNNNANNNN[a] into ATNNNNNANNGR[a]CT, took 0 iterations
[src/find_motifs/mod.rs::2540][2024-11-26 16:47:50][DEBUG] discarding AANNNNNNNGG[a]CTNNNNC (search), high-modified sites too low, 255
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:51][DEBUG] discarding ACNANNNNNNRG[a]CT (search), fraction modified too low, 0.24649532
[src/find_motifs/mod.rs::1918][2024-11-26 16:47:51][DEBUG] no changes, finished refining CNNGNNNNNNNA[a] into CNNGNNNNNNGA[a]CTS, took 0 iterations
[src/find_motifs/mod.rs::2546][2024-11-26 16:47:52][DEBUG] discarding AANNNGG[a]CTNNNNNG (search), fraction modified too low, 0.2937853
[src/find_motifs/mod.rs::2540][2024-11-26 16:47:53][DEBUG] discarding ANGG[a]CTANNNC (search), high-modified sites too low, 159

My questions are thus:

1: Are the stringency settings on modkit pileup perhaps too high, resulting in fewer sites being called? 2: Does the fact that we had to set the settings so high say something about the raw data quality that I might be missing?

I also realize that if I want the motif search to finish I can play around more with the --low-thresh --high-thresh parameters, but I'm not sure if that's something which makes sense here, or if instead I should take the lowish m6a rate as a sign of something a miss with an earlier step in data processing.

We're the first lab in the department to do this kind of direct RNA sequencing + the first one to try it with the sequencing facility so I'm just a bit confused as to what these results might mean re: quality of our data.

Thank you for the help!

aleighbrown commented 3 days ago

Also exploring the probability distributions with modkit sample-probs - here I'm just showing the cumulative probability for our 2 conditions, 3 samples each, and there's a systemic difference between our conditions for the canonical A and the modified A - this seems to be problematic to me. Is there an obvious reason we might observe this either technically?

image image

ArtRand commented 2 days ago

Hello @aleighbrown,

Could you let me know which m6A model you're using (which version)? If you used the automatic model selection, I need the dorado version.

Thanks for this analysis, looking for the proportion of 6mA calls within DRACH motifs is a nice sanity check (I'm assuming you ran the "all-context" m6A model).

As far as the motif searching, could you run modkit motif evaluate (documentation is here)? You can use --known-motif DRACH 2 instead of providing a whole table. I suspect that you have low levels of m6A in your sample and the default settings aren't picking it up. The default settings are mostly set up for bacterial motif searching.

Let me take a look at the output probabilities on some data I have and circle back. My suspicion is that the sample has low m6A and the ECDF is essentially dominated by low confidence false positive calls.

aleighbrown commented 2 days ago

Yes we ran the "all-context" m6A model on these samples.

We ran this actually in 2 sets, one sample (Knockdown 1 in this graph) image was run with dorodo version number 0.7.3+6e6c45c CL:dorado basecaller sup,m6A and the other 5 were run with dorodo version 0.8.1+c3a2952 CL:dorado basecaller sup@v5.0.0,m6A

You can see from the above graph that even at the higher m6a probabilities there's a slight reduction in high probability m6a calls for the sample run on the earlier dorodo version so we're actually planning to recall this data anyhow with dorodo v 0.8.3 with the dorado_model sup@v5.1.0,m6A_DRACH basecalling models to maintain consistency across the samples.

We have some orthogonal reasons to suspect that our knockdown should reduce overall levels of RNA m6a - would this affect the whole probability estimation in someway that I'm not grokking?

I'll report back with the results on the newer model run.

aleighbrown commented 2 days ago

So motif evaluate gives us results like this - with the default settings for --low-thresh 0.2 and --high-thresh 0.6

evaluated motifs:
+---------+------------+------------+-----------+-----------+----------+
| motif   | frac_mod   | high_count | low_count | mid_count | log_odds |
+=========+============+============+===========+===========+==========+
| DR[a]CH | 0.04059856 | 58296      | 1377617   | 88636     | 8.187913 |
+---------+------------+------------+-----------+-----------+----------+

What I was hoping to accomplish with motif search was to see if the methylated motifs are changing between our control and KD (more specifically I was curious to know if higher strength DRACH motifs were less affected by KD than lower strength ones), I wonder if rather than using motif search to find de-novo motifs I could instead input all the DRACH motifs into motif evaluate and compare the rates there...

Also do these rates seem...reasonable for a sample? It's hard for me to find information published on direct-RNA seq which is comparable