nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
139 stars 8 forks source link

m6A analysis #290

Closed Taylorain closed 19 hours ago

Taylorain commented 1 week ago

Hi, thank you for creating and maintaining this excellent tool. I have encountered a few questions while using modkit for m6A detection.

  1. Extension of Regions with modkit pileup and modkit dmr pair:
    When using modkit commands for m6A detection, should I extend each DRACH motif by 10 bp on both sides, similar to how the MINES tool (cDNA_MINES.py) extends regions to increase sensitivity?

    Reference for MINES tool extension: "For m6A detection, MINES tool (cDNA_MINES.py) with default parameters was implemented to detect m6A modification based on the de novo model, in which all regions containing a DRACH motif were identified and a new set of regions was generated by extending 10 bp on both sides of the ‘A’ within the DRACH motifs. These regions with coverage > 5 were filtered and subjected to further analysis."

  2. Identifying Significant Differential Methylation:
    In my analysis with modkit dmr pair, what is the recommended approach to identify significantly altered m6A modifications between groups? Should I use the following criteria: a log2(fold change) cut-off greater than 0.5 and a p-value lower than 0.05?

    • Calculating log2(fold change):
      Does the following code correctly calculate the log2(fold change) for each modification site?
      awk '{
      a_pct = $12;   
      b_pct = $13;   
      if (a_pct == 0) a_pct = 0.0001;   
      log2fc = log(b_pct / a_pct) / log(2);  
      print $0, log2fc;
      }' dmr_output_file > output_with_log2fc.txt
  3. Using a_mod_percentage as the fraction:
    Can I use the a_mod_percentage field as the fraction to apply additional criteria such as “fraction > 0.7 and coverage > 10” for filtering?

  4. Visualizing Methylation Profiles:
    Is there a way to visualize methylation profiles using modkit, or do you recommend an external tool for visualization?

5.How can I calculate the number of methylated sites on a read or within a transcript using modkit?

Thank you for your guidance!

ArtRand commented 6 days ago

Hello @Taylorain,

Sorry for the delay getting back.

  1. It's unnecessary to expand the region like this.

what is the recommended approach to identify significantly altered m6A modifications between groups? Should I use the following criteria: a log2(fold change) cut-off greater than 0.5 and a p-value lower than 0.05?

  1. I don't have a firm recommendation for the decision function to use, since it somewhat depends on your sample and biological question. What you've proposed seems reasonable, however, and I may consider adding log2 FC to the output. Your calculation is correct (although you may add the constant to both a and b).

  2. Apologies, the header [a|b]_mod_percentage is incorrect it should be [a|b]_mod_fraction. The schema in the documentation is correct. Your filtering seems reasonable.

  3. I don't have a tool to strongly recommend. Adding some more visualization in modkit itself is on the roadmap, however. So I'd be curious about what you're looking for. Some people have reported that they like methylartist.

  4. To calculate the total number of methylated bases on a read, use modkit extract calls it produces a table of per-read metrics. To calculate the number of modified bases on a per-transcript level use modkit stats where your --regions argument has the entire transcript.

Thanks for the excellent questions and noticing that the header is incorrect.

Taylorain commented 5 days ago

Thank you for your patience and assistance.

Your calculation is correct (although you may add the constant to both a and b).

I have a control group and a treatment group. I would like to clarify my understanding of the calculation for log2(fc + 1). Is it correct to express it as:

log2(fc + 1) = log((b_pct / a_pct) + 1) / log(2)
ArtRand commented 3 days ago

@Taylorain

In the calculations I've seen that use a pseudocount they add the pseudocount to both samples (e.g. here).

So: log2fc = (a_frac + 0.01).log2() - (b_frac + 0.01).log2.

Or if you only have the natural log in awk:

log2fc = (log(a_frac + 0.01) - log(b_frac + 0.01)) / log(2)
Taylorain commented 19 hours ago

I get, thanks pretty much!