nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
139 stars 7 forks source link

m6A methylation analysis and explanation of output files #276

Open JhinJhinJhin opened 1 week ago

JhinJhinJhin commented 1 week ago

methylation distribution on read

I want to analyze the RNA m6A modification. I selected m6A_DRACH model when dorado basecaller. Then I use modkit for downstream analysis. modkit pileup $bam \ ${id}_pileup.bed \ --ref $ref \ -t 12 --log-filepath ${id}_pileup.log

bgzip -@ 4 -k $bed tabix -@ 4 -p bed ${bed}.gz modkit motif bed DM8.1.fa DRACH 2 1> DM8_DRACH.bed modkit extract calls -t 8 $bam extract_calls/SDL-1_ref_calls.tsv --ref $genome --include-bed DM8_DRACH.bed I use modkit extract calls command to generate a table of read-level base modification calls.

read_id forward_read_position ref_position chrom mod_strand ref_strand ref_mod_strand fw_soft_clipped_start fw_soft_clipped_end read_length call_prob call_code base_qual ref_kmer query_kmer canonical_base modified_primary_base fail inferred within_alignment flag ad2e3781-d1bf-4457-a6e5-e48638151a51 1177 901581 chr01 + - - 0 87 1299 0.9941406 - 24 TGTTT AAACA A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 1082 901676 chr01 + - - 0 87 1299 0.9980469 - 28 TGTCA TGACA A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 1008 901750 chr01 + - - 0 87 1299 0.9980469 - 23 GGTTT AAACC A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 955 901803 chr01 + - - 0 87 1299 0.9433594 - 12 TGTTC GAACA A A true false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 887 901881 chr01 + - - 0 87 1299 0.9980469 - 17 TGTCA TGACA A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 878 901890 chr01 + - - 0 87 1299 0.9980469 - 23 AGTTT AAACT A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 809 901960 chr01 + - - 0 87 1299 0.9980469 - 20 AGTCC GGACT A A false false true 16 ad2e3781-d1bf-4457-a6e5-e48638151a51 786 901983 chr01 + - - 0 87 1299 0.8808594 - 19 AGTCA TGACT A A true false true 16 c90cf7b9-674f-46d6-bfa4-936058b6ea0e 126 971421 chr01 + - - 0 68 1258 0.8222656 a 23 TGTTA TAACA A A true false true 16 c90cf7b9-674f-46d6-bfa4-936058b6ea0e 54 972330 chr01 + - - 0 68 1258 0.9980469 - 15 AGTTT AAACT A A false false true 16 c90cf7b9-674f-46d6-bfa4-936058b6ea0e 2 972382 chr01 + - - 0 68 1258 0.9980469 - 19 TGTCA TGACA A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 253 901581 chr01 + - - 2 80 340 0.9980469 - 19 TGTTT AAACA A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 176 970230 chr01 + - - 2 80 340 0.9980469 - 30 GGTCT AGACC A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 163 970243 chr01 + - - 2 80 340 0.9902344 - 31 TGTTA TAACA A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 89 970317 chr01 + - - 2 80 340 0.9980469 - 17 GGTTT AAACC A A false false true 16 6ade67ef-11c0-4ea6-8755-cbc0da1a4f1f 36 970370 chr01 + - - 2 80 340 0.9980469 - 27 TGTTC GAACA A A false false true 16

I would like to know what conditions satisfy the site that can be considered as a modified site on the read. Is it when 'fail' and 'inferred' are both false? And I’m not quite sure what ‘implicit canonical’ means either.

Detecting differential modification at single base positions

modkit dmr pair \ -a SDL-1_pileup.bed.gz \ -a SDT-1_pileup.bed.gz \ -b SDL-2_pileup.bed.gz \ -b SDT-1_pileup.bed.gz \ -o SLSD_dmr/SLSD \ --ref $genome \ --base A \ -t 8 \ -f \ --log-filepath SLSD_dmr/dmr_multi.log

SLSD:

chr01 428377 428378 . -0.2595111954850844 a:0 1 a:0 1 a:0.00 a:0.00 0 0 1 0 1 0 50 50 1 0 chr01 428819 428820 . 0.25131442828090655 a:0 1 a:1 2 a:0.00 a:50.00 0 0.5 0.9321833846700683 -0.5 0.9321833846700683 -0.5 50 100 - - chr01 428876 428877 . -0.29257296267155475 a:3 4 a:3 4 a:75.00 a:75.00 0.75 0.75 1 0 1 0 100 100 1,1 0,0 chr01 428877 428878 . -0.2595111954850844 a:1 1 a:1 1 a:100.00 a:100.00 1 1 1 0 1 0 50 50 1 0 chr01 428912 428913 . -0.3006875128982047 a:0 2 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 100 100 1,1 0,0 chr01 428921 428922 . -0.2595111954850844 a:0 1 a:0 1 a:0.00 a:0.00 0 0 1 0 1 0 50 50 1 0 chr01 428926 428927 . -0.27243263335211054 a:18 26 a:25 34 a:69.23 a:73.53 0.6923077 0.7352941 1 -0.042986425339366585 1 0.015837104072398134 100 100 1,1 -0.04029304029304026,0 chr01 428934 428935 . -0.33202943912385763 a:0 10 a:0 5 a:0.00 a:0.00 0 0 1 0 1 0 100 100 1,1 0,0 chr01 428944 428945 . 0.3431471740471608 a:45 45 a:45 46 a:100.00 a:97.83 1 0.9782609 1 0.021739130434782594 1 0.021739130434782594 100 100 1,1 0.0357142857142857,0 chr01 428951 428952 . 1.1360555327687223 a:11 36 a:5 36 a:30.56 a:13.89 0.30555555 0.1388889 0.1987868090862803 0.16666666666666669 0.17709312475070266 0.16666666666666669 100 100 0.10617958219713411,1 0.3000000chr01 429017 429018 . -0.2769449125416088 a:49 63 a:45 60 a:77.78 a:75.00 0.7777778 0.75 1 0.02777777777777779 1 0.04086021505376347 100 100 0.8874779988874301,1 0.07027649769585254,0 chr01 429027 429028 . -0.20933341083340906 a:37 54 a:38 52 a:68.52 a:73.08 0.6851852 0.7307692 1 -0.045584045584045496 1 -0.026353276353276334 100 100 0.781323132265617,1 -0.08333333333333337,0 chr01 429028 429029 . 0.24018299513881303 a:11 14 a:11 18 a:78.57 a:61.11 0.78571427 0.6111111 0.5489948392453672 0.17460317460317454 0.8185923129030523 0.10317460317460314 100 100 0.3876206189317763,1 0.3555555chr01 429029 429030 . 0.25131442828090655 a:0 1 a:1 2 a:0.00 a:50.00 0 0.5 0.9321833846700683 -0.5 0.9321833846700683 -0.5 50 100 - - chr01 429077 429078 . -0.3295915950843096 a:44 61 a:47 64 a:72.13 a:73.44 0.72131145 0.734375 1 -0.01306352459016391 1 -0.017708333333333326 100 100 1,1 -0.025252525252525193,0 chr01 429078 429079 . -0.2776317365982791 a:0 1 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 50 100 - - chr01 429192 429193 . -0.2584535893156925 a:14 52 a:15 49 a:26.92 a:30.61 0.26923078 0.30612245 1 -0.036891679748822626 1 -0.04326923076923078 100 100 0.8664138256652923,1 -0.10287081339712917,0 chr01 429196 429197 . 0.2908615565132848 a:5 16 a:2 14 a:31.25 a:14.29 0.3125 0.14285715 0.5291319892925755 0.16964285714285715 0.403690372946813 0.17857142857142858 100 100 0.21078940602940324,1 0.428571428571428chr01 429201 429202 . -0.3076466561820299 a:0 3 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 100 50 - - chr01 429207 429208 . -0.3101721989665336 a:1 31 a:1 27 a:3.23 a:3.70 0.032258064 0.037037037 1 -0.004778972520908004 1 -0.005128205128205131 100 100 1,1 -0.024242424242424246,0 chr01 429218 429219 . -0.2994907503308539 a:24 27 a:30 33 a:88.89 a:90.91 0.8888889 0.90909094 1 -0.02020202020202022 1 0.016826923076923128 100 100 1,1 0,0 chr01 429223 429224 . -0.26578015463795346 a:1 15 a:1 10 a:6.67 a:10.00 0.06666667 0.1 1 -0.03333333333333334 1 -0.02857142857142858 100 100 1,1 0,0 chr01 429224 429225 . -0.3006875128982047 a:0 2 a:0 2 a:0.00 a:0.00 0 0 1 0 1 0 100 100 1,1 0,0 chr01 429229 429230 . 0.6238435756252585 a:0 17 a:1 11 a:0.00 a:9.09 0 0.09090909 1 -0.09090909090909091 1 -0.1 100 100 0.5762151737229898,1 -0.3333333333333333,0 chr01 429234 429235 . -0.1418632197203351 a:3 6 a:4 6 a:50.00 a:66.67 0.5 0.6666667 0.8376587524553095 -0.16666666666666663 0.837695883342009 -0.16666666666666669 100 100 0.7710310269619939,1 -0.25,0 chr01 429239 429240 . 0.0001373942800455552 a:678 701 a:700 730 a:96.72 a:95.89 0.9671897 0.9589041 1 0.0102679011129716 1 0.0102679011129716 100 100 1,1 0.020855904658721558,-0.00029547916871863755 chr01 429250 429251 . -0.27925811985412086 a:4 5 a:3 4 a:80.00 a:75.00 0.8 0.75 1 0.050000000000000044 0.7710310269619939 0.25 100 100 0.9321833846700683,1 0.5,0 chr01 429253 429254 . -0.2595111954850844 a:0 1 a:0 1 a:0.00 a:0.00 0 0 1 0 1 0 50 50 1 0 chr01 429264 429265 . 0.25131442828090655 a:1 2 a:0 1 a:50.00 a:0.00 0.5 0 0.9321833846700683 0.5 0.9321833846700683 0.5 50 50 0.9321833846700683 0.5 chr01 429330 429331 . -0.18012700383224 a:479 656 a:497 694 a:73.02 a:71.61 0.73018295 0.7161383 1 0.012090022653402976 1 0.022678026199152934 100 100 1,1 0.043030674504778044,0.001280409731113985

I don’t know which sites can be considered as sites with significant differential methylation . Can you give me an example?

ArtRand commented 1 week ago

Hello @JhinJhinJhin,

I apologize for the slow response, I've been OOO.

Your commands for pileup, extract calls, and motif bed look good to me.

I would like to know what conditions satisfy the site that can be considered as a modified site on the read. Is it when 'fail' and 'inferred' are both false? And I’m not quite sure what ‘implicit canonical’ means either.

When you use modkit extract calls the call_code column indicates the predicted base modification status for each position with base modification probabilities in each read. The code is specified in the modBAM (by the base modification model). In the case of m6A, a will be m6A and - will be canonical. As per the documentation the fail column indicates if the base modification call probability is below the pass threshold. The inferred column only means whether or not a canonical call was assumed to be canonical due to the absence of a base modification probability, see the SAM specification, page 7 for details. For your application, using the DRACH model, you don't need to worry about "implicit canonical" since any of the motif models (DRACH, CpG) should not emit these kinds of tags.

I don’t know which sites can be considered as sites with significant differential methylation . Can you give me an example?

As far as the scoring, please refer to the documentation on the models, for your application I suggest looking at the MAP-based p-value. You can sort the results either by the score column or the MAP-based p-value column (find the schema in the documentation). The next step depends on your biological question, you could, for example, look for transcripts with the most differentially methylated sites. You may also be interested in the entropy command. You could calculate the entropy on each transcript and then look at the differences between the two samples.

JhinJhinJhin commented 1 week ago

Thank you for your patient reply! But I still have some questions.

You may also be interested in the entropy command. You could calculate the entropy on each transcript and then look at the differences between the two samples.

After I get the entropy, then how do I calculate if the hughs from two different samples are significantly different on a transcript?

And I want to detect m5C and pseU modification on RNA, what --motif parameter shoud I write to modkit motif bed , modkit pileup etc.?

Taylorain commented 4 days ago

Hi, I'm also using modkit for DRS analysis, but I'm unsure about how to determine the filter threshold and --mod-thresholds for different base modifications (m6A, m5C, pseU). Is this the correct code to use?

modkit sample-probs -t 64 --log-filepath sample-probs.log /2.merge_bam/sorted_bam \
--out-dir /2.merge_bam/sample_probs.output --hist --no-sampling

If I don't specify the 'filter threshold' and '--mod-thresholds' parameters, what are the default threshold values?

ArtRand commented 3 days ago

Hello @JhinJhinJhin,

After I get the entropy, then how do I calculate if the hughs from two different samples are significantly different on a transcript?

I don't have a worked out solution for this, adding differential entropy detection is something I'm working on. I would start with plotting the distribution of difference in entropy between your samples.

And I want to detect m5C and pseU modification on RNA, what --motif parameter shoud I write to modkit motif bed , modkit pileup etc.?

You can use --motif C 0 for m5C and --motif T 0 for pseU, unless of course you're looking for a more specific motif for either of these bases.

ArtRand commented 3 days ago

Hello @Taylorain,

Your command looks correct. It will estimate the probability distributions for each of the modified primary bases and their respective chemical modifications. The table (schema here) can be used to determine a threshold value for each modified base and canonical call. There is documentation describing how to use the --filter-threshold and --mod-threshold arguments. The default behavior is to estimate a pass threshold for each primary sequence base with observed modifications. The details are here, but briefly the algorithm will attempt to filter out (i.e. remove) the 10% lowest confidence calls. If you have concerns with the per-modification probabilities using sample-probs is a great first step.

Taylorain commented 2 days ago

@ArtRand I get, thanks pretty much!