m6A methylation analysis and explanation of output files

JhinJhinJhin commented 1 month ago

methylation distribution on read

I want to analyze the RNA m6A modification. I selected m6A_DRACH model when dorado basecaller. Then I use modkit for downstream analysis. modkit pileup $bam \ ${id}_pileup.bed \ --ref $ref \ -t 12 --log-filepath ${id}_pileup.log

bgzip -@ 4 -k $bed tabix -@ 4 -p bed ${bed}.gz modkit motif bed DM8.1.fa DRACH 2 1> DM8_DRACH.bed modkit extract calls -t 8 $bam extract_calls/SDL-1_ref_calls.tsv --ref $genome --include-bed DM8_DRACH.bed I use modkit extract calls command to generate a table of read-level base modification calls.

read_id forward_read_position ref_position chrom mod_strand ref_strand ref_mod_strand fw_soft_clipped_start fw_soft_clipped_end read_length call_prob call_code base_qual ref_kmer query_kmer canonical_base modified_primary_base fail inferred within_alignment flag ad2e3781-d1bf-4457-a6e5-e48638151a51 1177 901581 chr01 + - - 0 87 1299 0.9941406 - 24 TGTTT AAACA A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 1082 901676 chr01 + - - 0 87 1299 0.9980469 - 28 TGTCA TGACA A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 1008 901750 chr01 + - - 0 87 1299 0.9980469 - 23 GGTTT AAACC A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 955 901803 chr01 + - - 0 87 1299 0.9433594 - 12 TGTTC GAACA A A true false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 887 901881 chr01 + - - 0 87 1299 0.9980469 - 17 TGTCA TGACA A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 878 901890 chr01 + - - 0 87 1299 0.9980469 - 23 AGTTT AAACT A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 809 901960 chr01 + - - 0 87 1299 0.9980469 - 20 AGTCC GGACT A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 786 901983 chr01 + - - 0 87 1299 0.8808594 - 19 AGTCA TGACT A A true false true 16 c90cf7b9-674f-46d6-bfa4-936058b6ea0e 126 971421 chr01 + - - 0 68 1258 0.8222656 a 23 TGTTA TAACA A A true false true 16 c90cf7b9-674f-46d6-bfa4-936058b6ea0e 54 972330 chr01 + - - 0 68 1258 0.9980469 - 15 AGTTT AAACT A A false false true 16 c90cf7b9-674f-46d6-bfa4-936058b6ea0e 2 972382 chr01 + - - 0 68 1258 0.9980469 - 19 TGTCA TGACA A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 253 901581 chr01 + - - 2 80 340 0.9980469 - 19 TGTTT AAACA A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 176 970230 chr01 + - - 2 80 340 0.9980469 - 30 GGTCT AGACC A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 163 970243 chr01 + - - 2 80 340 0.9902344 - 31 TGTTA TAACA A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 89 970317 chr01 + - - 2 80 340 0.9980469 - 17 GGTTT AAACC A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 36 970370 chr01 + - - 2 80 340 0.9980469 - 27 TGTTC GAACA A A false false true 16

I would like to know what conditions satisfy the site that can be considered as a modified site on the read. Is it when 'fail' and 'inferred' are both false? And I’m not quite sure what ‘implicit canonical’ means either.

Detecting differential modification at single base positions

modkit dmr pair \ -a SDL-1_pileup.bed.gz \ -a SDT-1_pileup.bed.gz \ -b SDL-2_pileup.bed.gz \ -b SDT-1_pileup.bed.gz \ -o SLSD_dmr/SLSD \ --ref $genome \ --base A \ -t 8 \ -f \ --log-filepath SLSD_dmr/dmr_multi.log

SLSD:

chr01 428377 428378 . chr01 428819 428820 . chr01 428876 428877 . chr01 428877 428878 . chr01 428912 428913 . chr01 428921 428922 . chr01 428926 428927 . chr01 428934 428935 . chr01 428944 428945 . chr01 428951 428952 . chr01 429027 429028 . chr01 429028 429029 . chr01 429077 429078 . chr01 429078 429079 . chr01 429192 429193 . chr01 429196 429197 . chr01 429207 429208 . chr01 429218 429219 . chr01 429223 429224 . chr01 429224 429225 . chr01 429229 429230 . chr01 429234 429235 . chr01 429239 429240 . chr01 429250 429251 . chr01 429253 429254 . chr01 429264 429265 . chr01 429330 429331 . -0.2595111954850844 a:0 1 a:0 1 a:0.00 a:0.00 0 0 1 0 1 0 50 50 1 0 0.25131442828090655 a:0 1 a:1 2 a:0.00 a:50.00 0 0.5 0.9321833846700683 -0.5 0.9321833846700683 -0.5 50 100 - - -0.29257296267155475 a:3 4 a:3 4 a:75.00 a:75.00 0.75 0.75 1 0 1 0 100 100 1,1 0,0 -0.2595111954850844 a:1 1 a:1 1 a:100.00 a:100.00 1 1 1 0 1 0 50 50 1 0 -0.3006875128982047 a:0 2 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 100 100 1,1 0,0 -0.2595111954850844 a:0 1 a:0 1 a:0.00 a:0.00 0 0 1 0 1 0 50 50 1 0 -0.27243263335211054 a:18 26 a:25 34 a:69.23 a:73.53 0.6923077 0.7352941 1 -0.042986425339366585 1 0.015837104072398134 100 100 1,1 -0.04029304029304026,0 -0.33202943912385763 a:0 10 a:0 5 a:0.00 a:0.00 0 0 1 0 1 0 100 100 1,1 0,0 0.3431471740471608 a:45 45 a:45 46 a:100.00 a:97.83 1 0.9782609 1 0.021739130434782594 1 0.021739130434782594 100 100 1,1 0.0357142857142857,0 1.1360555327687223 a:11 36 a:5 36 a:30.56 a:13.89 0.30555555 0.1388889 0.1987868090862803 0.16666666666666669 0.17709312475070266 0.16666666666666669 100 100 0.10617958219713411,1 0.3000000chr01 429017 429018 . -0.2769449125416088 a:49 63 a:45 60 a:77.78 a:75.00 0.7777778 0.75 1 0.02777777777777779 1 0.04086021505376347 100 100 0.8874779988874301,1 0.07027649769585254,0 -0.20933341083340906 a:37 54 a:38 52 a:68.52 a:73.08 0.6851852 0.7307692 1 -0.045584045584045496 1 -0.026353276353276334 100 100 0.781323132265617,1 -0.08333333333333337,0 0.24018299513881303 a:11 14 a:11 18 a:78.57 a:61.11 0.78571427 0.6111111 0.5489948392453672 0.17460317460317454 0.8185923129030523 0.10317460317460314 100 100 0.3876206189317763,1 0.3555555chr01 429029 429030 . 0.25131442828090655 a:0 1 a:1 2 a:0.00 a:50.00 0 0.5 0.9321833846700683 -0.5 0.9321833846700683 -0.5 50 100 - - -0.3295915950843096 a:44 61 a:47 64 a:72.13 a:73.44 0.72131145 0.734375 1 -0.01306352459016391 1 -0.017708333333333326 100 100 1,1 -0.025252525252525193,0 -0.2776317365982791 a:0 1 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 50 100 - - -0.2584535893156925 a:14 52 a:15 49 a:26.92 a:30.61 0.26923078 0.30612245 1 -0.036891679748822626 1 -0.04326923076923078 100 100 0.8664138256652923,1 -0.10287081339712917,0 0.2908615565132848 a:5 16 a:2 14 a:31.25 a:14.29 0.3125 0.14285715 0.5291319892925755 0.16964285714285715 0.403690372946813 0.17857142857142858 100 100 0.21078940602940324,1 0.428571428571428chr01 429201 429202 . -0.3076466561820299 a:0 3 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 100 50 - - -0.3101721989665336 a:1 31 a:1 27 a:3.23 a:3.70 0.032258064 0.037037037 1 -0.004778972520908004 1 -0.005128205128205131 100 100 1,1 -0.024242424242424246,0 -0.2994907503308539 a:24 27 a:30 33 a:88.89 a:90.91 0.8888889 0.90909094 1 -0.02020202020202022 1 0.016826923076923128 100 100 1,1 0,0 -0.26578015463795346 a:1 15 a:1 10 a:6.67 a:10.00 0.06666667 0.1 1 -0.03333333333333334 1 -0.02857142857142858 100 100 1,1 0,0 -0.3006875128982047 a:0 2 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 100 100 1,1 0,0 0.6238435756252585 a:0 17 a:1 11 a:0.00 a:9.09 0 0.09090909 1 -0.09090909090909091 1 -0.1 100 100 0.5762151737229898,1 -0.3333333333333333,0 -0.1418632197203351 a:3 6 a:4 6 a:50.00 a:66.67 0.5 0.6666667 0.8376587524553095 -0.16666666666666663 0.837695883342009 -0.16666666666666669 100 100 0.7710310269619939,1 -0.25,0 0.0001373942800455552 a:678 701 a:700 730 a:96.72 a:95.89 0.9671897 0.9589041 1 0.0102679011129716 1 0.0102679011129716 100 100 1,1 0.020855904658721558,-0.00029547916871863755 -0.27925811985412086 a:4 5 a:3 4 a:80.00 a:75.00 0.8 0.75 1 0.050000000000000044 0.7710310269619939 0.25 100 100 0.9321833846700683,1 0.5,0 -0.2595111954850844 a:0 1 a:0 1 a:0.00 a:0.00 0 0 1 0 1 0 50 50 1 0 0.25131442828090655 a:1 2 a:0 1 a:50.00 a:0.00 0.5 0 0.9321833846700683 0.5 0.9321833846700683 0.5 50 50 0.9321833846700683 0.5 -0.18012700383224 a:479 656 a:497 694 a:73.02 a:71.61 0.73018295 0.7161383 1 0.012090022653402976 1 0.022678026199152934 100 100 1,1 0.043030674504778044,0.001280409731113985

I don’t know which sites can be considered as sites with significant differential methylation . Can you give me an example?

ArtRand commented 1 month ago

Hello @JhinJhinJhin,

I apologize for the slow response, I've been OOO.

Your commands for pileup, extract calls, and motif bed look good to me.

I would like to know what conditions satisfy the site that can be considered as a modified site on the read. Is it when 'fail' and 'inferred' are both false? And I’m not quite sure what ‘implicit canonical’ means either.

When you use modkit extract calls the call_code column indicates the predicted base modification status for each position with base modification probabilities in each read. The code is specified in the modBAM (by the base modification model). In the case of m6A, a will be m6A and - will be canonical. As per the documentation the fail column indicates if the base modification call probability is below the pass threshold. The inferred column only means whether or not a canonical call was assumed to be canonical due to the absence of a base modification probability, see the SAM specification, page 7 for details. For your application, using the DRACH model, you don't need to worry about "implicit canonical" since any of the motif models (DRACH, CpG) should not emit these kinds of tags.

I don’t know which sites can be considered as sites with significant differential methylation . Can you give me an example?

As far as the scoring, please refer to the documentation on the models, for your application I suggest looking at the MAP-based p-value. You can sort the results either by the score column or the MAP-based p-value column (find the schema in the documentation). The next step depends on your biological question, you could, for example, look for transcripts with the most differentially methylated sites. You may also be interested in the entropy command. You could calculate the entropy on each transcript and then look at the differences between the two samples.

JhinJhinJhin commented 1 month ago

Thank you for your patient reply! But I still have some questions.

You may also be interested in the entropy command. You could calculate the entropy on each transcript and then look at the differences between the two samples.

After I get the entropy, then how do I calculate if the hughs from two different samples are significantly different on a transcript?

And I want to detect m5C and pseU modification on RNA, what --motif parameter shoud I write to modkit motif bed , modkit pileup etc.?

Taylorain commented 1 month ago

Hi, I'm also using modkit for DRS analysis, but I'm unsure about how to determine the filter threshold and --mod-thresholds for different base modifications (m6A, m5C, pseU). Is this the correct code to use?

modkit sample-probs -t 64 --log-filepath sample-probs.log /2.merge_bam/sorted_bam \
--out-dir /2.merge_bam/sample_probs.output --hist --no-sampling

If I don't specify the 'filter threshold' and '--mod-thresholds' parameters, what are the default threshold values?

ArtRand commented 1 month ago

Hello @JhinJhinJhin,

After I get the entropy, then how do I calculate if the hughs from two different samples are significantly different on a transcript?

I don't have a worked out solution for this, adding differential entropy detection is something I'm working on. I would start with plotting the distribution of difference in entropy between your samples.

And I want to detect m5C and pseU modification on RNA, what --motif parameter shoud I write to modkit motif bed , modkit pileup etc.?

You can use --motif C 0 for m5C and --motif T 0 for pseU, unless of course you're looking for a more specific motif for either of these bases.

ArtRand commented 1 month ago

Hello @Taylorain,

Your command looks correct. It will estimate the probability distributions for each of the modified primary bases and their respective chemical modifications. The table (schema here) can be used to determine a threshold value for each modified base and canonical call. There is documentation describing how to use the --filter-threshold and --mod-threshold arguments. The default behavior is to estimate a pass threshold for each primary sequence base with observed modifications. The details are here, but briefly the algorithm will attempt to filter out (i.e. remove) the 10% lowest confidence calls. If you have concerns with the per-modification probabilities using sample-probs is a great first step.

Taylorain commented 1 month ago

@ArtRand I get, thanks pretty much!

tamalarafat commented 1 week ago

Hello, @JhinJhinJhin, @ArtRand Regarding your 1st question: "I would like to know what conditions satisfy the site that can be considered as a modified site on the read. Is it when 'fail' and 'inferred' are both false? And I’m not quite sure what ‘implicit canonical’ means either."

A site can be considered modified based on the Dorado model used for calling the modification—in your case, the m6A_DRACH model. When you apply modkit, it assesses the modification status at each site (specifically the A base for m6A). For each genomic position where an A is found, the extent of modification is detected and quantified, ranging from 0 to any number.

As for your second question: "Detecting differential modification at single base positions"

Using the m6A Dorado model, a site is considered significantly differentially modified if there is a statistically significant difference in modification rates between two conditions (e.g., condition A vs. condition B). The xPore documentation outlines an approach using Z-statistics to determine significance, although direct xPore analysis isn’t possible with Dorado basecalling outputs (e.g., .bam files from nanopore RNA/DNA-seq). However, you can apply similar statistical approaches in R or any language of your choice. I would say processing your modkit pileup results to generate .bed files for each sample, then applying statistical tests (e.g., p-values for each site) to filter for significant sites.

If it helps, I have developed a pipeline that does these tasks, including identifying significant modification sites, motif detection (like DRACH or others), and standard RNA-seq analysis. Let me know if you'd like more details.

JhinJhinJhin commented 1 week ago

Hello, @tamalarafat . Thans for your helpful reply ! I am glad to konw more details about your pipeline!

tamalarafat commented 1 week ago

Hello, @JhinJhinJhin Please check out my "nanopore_direct_rna_seq_pipeline" github repository. The pipeline can be used for gene expression analysis and RNA/DNA modification analysis.

JhinJhinJhin commented 1 week ago

Hello, @tamalarafat Thanks for sharing!

nanoporetech / modkit

m6A methylation analysis and explanation of output files #276

methylation distribution on read

Detecting differential modification at single base positions