nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
136 stars 7 forks source link

Obtain Global methylation level of the genome #173

Open OceaneMion opened 5 months ago

OceaneMion commented 5 months ago

Hi all, I would like to know how can we obtain the global methylation % of a genome using modkit pileup. I know that with the fraction modified I Can derive for example the % of methylated cpg, but I would like a more global view taking also in consideration the other nucleotides for the calculation

I know the calculation will look something like : (Number of methylated CpG / Genome size )*100

Also I am applying no filtering to the modkit pileup as I want all the possible probability per site, so I don't know if a global methylation level % is possible?

Thanks in avance for your help.

ArtRand commented 5 months ago

Hello @OceaneMion,

It depends on your biological question. Of course you can calculate the ratio however you want, a couple ways I can think of:

  1. Number of modified CpG calls divided by total number of CpGs (optionally ignoring 5hmC):

This command: modkit pileup $modbam $pileup_bed --combine-strands --cpg --only-tabs [--ignore h]

will produce a pileup BED file, then aggregate with awk:

awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} \
  END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' ${pileup_bed}

N.b. I added --only-tabs so the polars parsing code below will work.

  1. Alternatively, if you want to know the number of methylated CpG residues as a ratio of the total genome size you can no longer use the total number of calls (i.e. reads) that report a site as methylated. To do this, you'll need a slightly more complicated script such as this python code:
import polars as pl
X_pileup = pl.read_csv(
        pileup_fp,
        separator="\t",
        has_header=False,
        dtypes={
            "chrom": str,
            "start": int,
            "end": int,
            "name": str,
            "score": int,
            "strand": str,
            "tstart": int,
            "tend": int,
            "color": str,
            "coverage": int,
            "freq": float,
            "mod": int,
            "canon": int,
            "other": int,
            "del": int,
            "fail": int,
            "diff": int,
            "nocall": int,
        },
    )

# load a dictionary of chrom names to the chrom size
genome_sizes = ...

# You'll have to decide on threshold level over which you consider a site as "methylated".
threshold = 0.8 

for chrom, X_group in X_pileup.group_by("chrom"):
    num_modified = X_group.filter(pl.col("freq") > threshold).shape[0]
    genome_size = genome_sizes[chrom]
    results.append({
        "chrom": chrom, "frac_modified": num_modified / genome_size
    })

X_results = pl.DataFrame(results)

If I were looking for changes in methylation levels, I'd use method (1) since that calculation really gets at what you want to know: how many reads are calling 5mC vs canonical throughout the genome. Method (2) basically does the same thing, except now you need to decide when a site is "methylated" (I've written it as when a site is >80% modified). There are more clever ways to do this as well such as something like the model used in DMR.

Hope this helps.

mvolar commented 4 months ago

There are more clever ways to do this as well such as something like the model used in DMR.

So, would it be reasonable to say that we can also fit a beta distribution for each position in our genome and then find the "most modified bases" using a threshold i.e. 80% of the bases on that location are modified? Using the approach specified above on moderate to low coverages tends to yield to many results.

I have included an example python code below. The though process is to repeat this across my genome (the only problem is my prior will always be non-informative alpha=1,beta=1) and then set a fixed threshold (50%) to find the most likely methylated positions in my sample and than start from there.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Observed data
f = 0.2  # fraction of modified bases
n = 50   # total number of bases
k = f * n  # number of modified bases

# Prior parameters (non-informative prior)
alpha_prior = 1
beta_prior = 1

# Posterior parameters
alpha_post = alpha_prior + k
beta_post = beta_prior + (n - k)

# Define the Beta distribution
x = np.linspace(0, 1, 100)
y_prior = beta.pdf(x, alpha_prior, beta_prior)
y_post = beta.pdf(x, alpha_post, beta_post)

# Plot the prior and posterior distributions
plt.plot(x, y_prior, color='blue', linestyle='dashed', label='Prior')
plt.plot(x, y_post, color='red', label='Posterior')
plt.title('Prior and Posterior Beta Distributions')
plt.xlabel('Proportion of Modified Bases')
plt.ylabel('Density')
plt.legend()
plt.show()

# Calculate the probability that the true proportion is greater than a threshold
threshold = 0.5
probability = 1 - beta.cdf(threshold, alpha_post, beta_post)
print(f"Probability that the true proportion is greater than {threshold}: {probability}")
ArtRand commented 4 months ago

Hello @mvolar,

Using a Beta distribution to model the probability of modification at each position is a reasonable way to do it, in fact, this is how the MAP-based P-value works, I've also suggested something similar in the past. The only problem with the Beta is that it will not allow you to use multiple modifications (5hmC and 5mC), so you only get a posterior probability of modification as you've shown.

Maybe you could tell me what you're trying to do and I could help you more.

mvolar commented 4 months ago

Hi,

basically I have cell line mouse genome with both knockout and pseudowildtype/control sequenced at a relatively low coverage (15-20ish). I have already performed DMR detection, however this is only for detection of differentially methylated regions.

I would now like to find all "significantly methylated" CpG-s in both my datasets to see how genome wide methylation patterns behave (i.e. genes, introns, intragenic regions, repetitive regions etc. etc.), purely for explorative purposes. To do this I need a way to see which of my 21~ mil rows in the table are indeed methylated since the cells are highly synchronous (i.e. the probability for uniform distribution (either 0 or 1) on a single CpG is the highest)

ArtRand commented 4 months ago

Hello @mvolar,

Since you have two conditions, one thing you might want to try is using the segmentation function in modkit dmr pair. From there you can get regions of similar methylation levels as well as different. The segments that are "different" should have methylation levels that are different between the two conditions, presumably one of the two conditions will have high methylation levels while the other has low levels. The segments labeled as "same" may also be interesting, you can filter down to segments where both conditions have high methylation levels by parsing column 11 or 12. Let me know what you think.

mvolar commented 4 months ago

Just tried that, the problem is because of the low coverage the segmentation data is very bad at recognizing the differentially methylated segments i.e. I get whole chromosomes labeled as "same" , and the output is rather small/noninformative (see attachment, this is event with fine-grained flag) segments.zip

OceaneMion commented 4 months ago

modkit pileup $modbam $pileup_bed --combine-strands --cpg --only-tabs [--ignore h]

Hi I do not want to apply any threshold the goal is to have all the potential methylated sites, so as long as the threshold is > 0, I want to consider the site as methylated I also want to have the number of CpG instead of the genome size how to obtain that ? Not at the level of reads but on the assembly directly ?

So to be more precise I would like to have the number of CpG, number of 5hmC and number of 5mC at the level of my assembly and not at the level of reads

ArtRand commented 4 months ago

Hello @mvolar,

One thing you could try is increasing the --significance-factor in modkit dmr pair --segment. Increasing this number will have the effect that smaller differences (or less evidence) will be required for the model to transition to the "different" state. Perhaps obviously, the trade-off is an increase in false positives.

ArtRand commented 4 months ago

@OceaneMion I believe I've answered this question in https://github.com/nanoporetech/modkit/issues/192?