nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
128 stars 7 forks source link

Is there an easy way to generate % methylation by read position? #87

Open billytcl opened 9 months ago

billytcl commented 9 months ago

I'm looking at the new cfDNA methylation document put out by ONT https://community.nanoporetech.com/knowledge/know-how/cfDNA-methyl-profile and it has some figures on % methylation by read position.

Is there an easy way to generate this with modkit? There is the parse command but it outputs only the read position and mod qual, but not the actual call itself based on modkit's thresholding (which I'm guessing is what the plot in the post is describing -- post threshold % methylation calls). I would have to generate the table, then manually fish out what the threshold would be from the log file of modkit pileup, then do some more math, etc. Is there a way to do this quickly in modkit?

ArtRand commented 9 months ago

Hello @billytcl,

There isn't a simple modkit command to do this, you would need to write a data frame transformation to get an aggregation of methylation percentage over read positions. I can share some polars code with you if you need something ASAP.

Given the importance of cfDNA analysis, I think it makes sense to make an optional output table with the base modification call for each position on the read. I can try to have this feature in v0.2.3 coming out this week.

billytcl commented 9 months ago

Thanks! If possible, it would be great to generate it for the aligned read offset position as well, in addition to the read's position?

ArtRand commented 9 months ago

Hello @billytcl,

With modkit 0.2.3 you can produce a table of base modification calls at the read level with modkit extract using the --read-calls <output-tsv> option. Thee documentation of the schema is here. Passing the --read-calls option will trigger modkit to calculate the pass threshold the same way as it does in pileup. All of the same options to override the pass threshold and change the estimation settings are available as well.

If you want to aggregate the calls by read position (or aligned position) you can use the following polars code:

import polars as pl

X_calls = pl.read_csv("modkit_calls.tsv", separator="\t")

X_aligned = X_calls.filter(pl.col("within_alignment") & ~pl.col("fail"))\
    .with_columns([
        (pl.col("forward_read_position") - pl.col("fw_soft_clipped_start")).alias("forward_aligned_position"),
        (pl.col("call_code") != "-").alias("modified_call")
    ])

read_position_max = 200
X_aligned_position_mean_methylation = X_aligned\
    .filter(pl.col("forward_read_position") < read_position_max)\
    .group_by("forward_aligned_position").agg([
        pl.col("modified_call").mean().alias("mean_methylation")
    ])\
    .sort("forward_aligned_position")
billytcl commented 9 months ago

Thanks!! This is very useful. Am I correct in assuming that it will not print out things that fail the threshold?

ArtRand commented 9 months ago

Hello @billytcl,

The command will output a row for all base modification calls in the read. There is a column called fail that will be true iff the call_prob is below the pass threshold. The documentation was missing this column, I apologize for that. I've updated the description of the table schema and edited the polars command above.

Macdot3 commented 7 months ago

Hi @ArtRand, I am attempting to integrate the methylation percentage data obtained from pileup discussed in #129 with the data on read_position from modkit extract. Applying your code does not yield results because I have forward_read_position values exceeding 200. This happens with many samples. How can I adapt your code to address this? Here my R code:

library(readr)
library(dplyr)
X_calls <- read_delim("Linfo_mt6394_calls_extract_m.tsv", delim = "\t")

X_aligned <- filter(X_calls, X_calls$within_alignment %in% "TRUE" & X_calls$fail %in% "FALSE") %>% 
             mutate(forward_aligned_position = forward_read_position - fw_soft_clipped_start, modified_call = call_code != "-")

read_position_max = 200

X_aligned_position_mean_methylation <- X_aligned %>%
  filter(forward_read_position < ) %>%
  group_by(forward_aligned_position) %>%
  summarize(mean_methylation = mean(modified_call)) %>%
  arrange(forward_aligned_position)

Thanks