Open billytcl opened 9 months ago
Hello @billytcl,
There isn't a simple modkit
command to do this, you would need to write a data frame transformation to get an aggregation of methylation percentage over read positions. I can share some polars code with you if you need something ASAP.
Given the importance of cfDNA analysis, I think it makes sense to make an optional output table with the base modification call for each position on the read. I can try to have this feature in v0.2.3 coming out this week.
Thanks! If possible, it would be great to generate it for the aligned read offset position as well, in addition to the read's position?
Hello @billytcl,
With modkit 0.2.3 you can produce a table of base modification calls at the read level with modkit extract
using the --read-calls <output-tsv>
option. Thee documentation of the schema is here. Passing the --read-calls
option will trigger modkit
to calculate the pass threshold the same way as it does in pileup
. All of the same options to override the pass threshold and change the estimation settings are available as well.
If you want to aggregate the calls by read position (or aligned position) you can use the following polars code:
import polars as pl
X_calls = pl.read_csv("modkit_calls.tsv", separator="\t")
X_aligned = X_calls.filter(pl.col("within_alignment") & ~pl.col("fail"))\
.with_columns([
(pl.col("forward_read_position") - pl.col("fw_soft_clipped_start")).alias("forward_aligned_position"),
(pl.col("call_code") != "-").alias("modified_call")
])
read_position_max = 200
X_aligned_position_mean_methylation = X_aligned\
.filter(pl.col("forward_read_position") < read_position_max)\
.group_by("forward_aligned_position").agg([
pl.col("modified_call").mean().alias("mean_methylation")
])\
.sort("forward_aligned_position")
Thanks!! This is very useful. Am I correct in assuming that it will not print out things that fail the threshold?
Hello @billytcl,
The command will output a row for all base modification calls in the read. There is a column called fail
that will be true iff the call_prob
is below the pass threshold. The documentation was missing this column, I apologize for that. I've updated the description of the table schema and edited the polars command above.
Hi @ArtRand, I am attempting to integrate the methylation percentage data obtained from pileup discussed in #129 with the data on read_position from modkit extract. Applying your code does not yield results because I have forward_read_position values exceeding 200. This happens with many samples. How can I adapt your code to address this? Here my R code:
library(readr)
library(dplyr)
X_calls <- read_delim("Linfo_mt6394_calls_extract_m.tsv", delim = "\t")
X_aligned <- filter(X_calls, X_calls$within_alignment %in% "TRUE" & X_calls$fail %in% "FALSE") %>%
mutate(forward_aligned_position = forward_read_position - fw_soft_clipped_start, modified_call = call_code != "-")
read_position_max = 200
X_aligned_position_mean_methylation <- X_aligned %>%
filter(forward_read_position < ) %>%
group_by(forward_aligned_position) %>%
summarize(mean_methylation = mean(modified_call)) %>%
arrange(forward_aligned_position)
Thanks
I'm looking at the new cfDNA methylation document put out by ONT https://community.nanoporetech.com/knowledge/know-how/cfDNA-methyl-profile and it has some figures on % methylation by read position.
Is there an easy way to generate this with modkit? There is the parse command but it outputs only the read position and mod qual, but not the actual call itself based on modkit's thresholding (which I'm guessing is what the plot in the post is describing -- post threshold % methylation calls). I would have to generate the table, then manually fish out what the threshold would be from the log file of modkit pileup, then do some more math, etc. Is there a way to do this quickly in modkit?