Closed Ge0rges closed 9 months ago
Hello @Ge0rges,
I think there should be some filters for bedMethyl records going into DMR. Filtering on valid coverage certainly makes sense. You could filter the bedMethyl on column 5 before running DMR, and I can add this option to the next version.
Regarding,
results with a minimum coverage for any given modification. i.e. it has to be seen in X% of reads etc.?
I'm less certain about how to approach this. What are you observing?
From a quick read of the MethylKit docs, I think you could use a valid coverage filter for min coverage. From there, you can calculate the percent methylation difference by parsing columns 10 and 11 in the modkit dmr
output and subtracting them. I've attached some polars code for doing this parsing. To decide on a score threshold, I recommend visualizing the alignments in IGV at different scores. You can also plot the scores and fit a distribution.
You could transform the output of modkit pileup
to be compatible with MehylKit (in the C/5mC case) and compare the results. Not to detract anything from MethylKit, but often packages designed for BS-Seq will make different decisions around modeling and data requirements that may not hold for Nanopore data. But if you have an example where the modkit dmr
method is not performing well I'd be more than happy to help work through it.
dmr_dtypes = {
"chrom": str,
"start": int,
"stop": int,
"name": str,
"score": float,
"a_counts": str,
"a_total": int,
"b_counts": str,
"b_total": int,
"a_frac": str,
"b_frac": str
}
X_dmr = pl.read_csv(dmr_fp, separator="\t", has_header=False, dtypes=dmr_dtypes).lazy()
def etl_dmr_bed(X, min_score=4):
pattern = r"([0-9,.]{3,})"
X_dmr_cleaned = (
X.filter(
(pl.col("a_frac") != ".") & (pl.col("b_frac") != ".") & (pl.col("score") >= min_score)
)
.with_columns(
[
pl.col("a_frac").str.split(",").alias("tmp.a_split"),
pl.col("b_frac").str.split(",").alias("tmp.b_split"),
]
)
.with_columns(
[
pl.col("tmp.a_split")
.list.first()
.str.extract(pattern=pattern)
.alias("tmp.raw_a_h_pct"),
pl.col("tmp.a_split")
.list.last()
.str.extract(pattern=pattern)
.alias("tmp.raw_a_m_pct"),
pl.col("tmp.b_split")
.list.first()
.str.extract(pattern=pattern)
.alias("tmp.raw_b_h_pct"),
pl.col("tmp.b_split")
.list.last()
.str.extract(pattern=pattern)
.alias("tmp.raw_b_m_pct"),
]
)
.with_columns(
[
pl.col("tmp.raw_a_h_pct").cast(pl.Float64).alias("h_a"),
pl.col("tmp.raw_a_m_pct").cast(pl.Float64).alias("m_a"),
pl.col("tmp.raw_b_h_pct").cast(pl.Float64).alias("h_b"),
pl.col("tmp.raw_b_m_pct").cast(pl.Float64).alias("m_b"),
]
)
.select(~cs.matches("tmp*"))
.collect()
)
return X_dmr_cleaned
X_dmr_filt = etl_dmr_bed(X_dmr) # optionally with min_score == 0 to keep everything
Great! Thank you for this thorough answer. I'm definitely interested in moving away from methylkit
which has been behind the many questions.
I imagine there will be a fair number of people who are familiar with BS-Seq packages, so any feedback that helps that transition is appreciated.
I would say the main thing that methylkit
does which modkit
lacks and is in scope for it, is dmr
across multiple samples rather than just pairwise comparison. So the calculation of the mean methylation patterns across samples, and the dmr
between each sample and that baseline.
@ArtRand I think I didn't get what you meant by
I think you could use a valid coverage filter for min coverage
as I don't see a flag for that. Did you mean simply filtering the input bam file for reads that have a certain coverage only? I was wondering if I would end up losing some data, versus looking at coverage only at methylated positions, but maybe I haven't wrapped my head around this well.
Out of curiosity (and much anticipation) I was wondering when the next release is planned for?
@Ge0rges sorry for the slow response,
If I recall correctly, I was suggesting that you can use the N_valid_coverage
column in the bedMethyl for any filtering on coverage (i.e. number of reads reporting methylation status at a site). There is no flag for this, so you'll have to do it on the files yourself ahead of modkit dmr
or MethylKit analysis. I'm going to make this an command line option in the next version for dmr
along with support for testing individual bases instead of regions.
Release would normally be this week, but it's a holiday week in the US - so probably early next week.
Not slow! Thank you for there reply. Happy thanksgiving!
I was wondering if it's possible to filter
dmr
results with a minimum coverage for any given modification. i.e. it has to be seen in X% of reads etc.? I'm trying how to understand how to apply the min coverage and q value filters offered bymethylkit
using the score produced bydmr
.