nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
128 stars 7 forks source link

`dmr` with coverage threshold #83

Closed Ge0rges closed 9 months ago

Ge0rges commented 10 months ago

I was wondering if it's possible to filter dmr results with a minimum coverage for any given modification. i.e. it has to be seen in X% of reads etc.? I'm trying how to understand how to apply the min coverage and q value filters offered by methylkit using the score produced by dmr.

ArtRand commented 10 months ago

Hello @Ge0rges,

I think there should be some filters for bedMethyl records going into DMR. Filtering on valid coverage certainly makes sense. You could filter the bedMethyl on column 5 before running DMR, and I can add this option to the next version.

Regarding,

results with a minimum coverage for any given modification. i.e. it has to be seen in X% of reads etc.?

I'm less certain about how to approach this. What are you observing?

From a quick read of the MethylKit docs, I think you could use a valid coverage filter for min coverage. From there, you can calculate the percent methylation difference by parsing columns 10 and 11 in the modkit dmr output and subtracting them. I've attached some polars code for doing this parsing. To decide on a score threshold, I recommend visualizing the alignments in IGV at different scores. You can also plot the scores and fit a distribution.

You could transform the output of modkit pileup to be compatible with MehylKit (in the C/5mC case) and compare the results. Not to detract anything from MethylKit, but often packages designed for BS-Seq will make different decisions around modeling and data requirements that may not hold for Nanopore data. But if you have an example where the modkit dmr method is not performing well I'd be more than happy to help work through it.

dmr_dtypes = {
    "chrom": str,
    "start": int,
    "stop": int,
    "name": str,
    "score": float,
    "a_counts": str,
    "a_total": int,
    "b_counts": str,
    "b_total": int,
    "a_frac": str,
    "b_frac": str
}
X_dmr = pl.read_csv(dmr_fp, separator="\t", has_header=False, dtypes=dmr_dtypes).lazy()

def etl_dmr_bed(X, min_score=4):
    pattern = r"([0-9,.]{3,})"
    X_dmr_cleaned = (
        X.filter(
            (pl.col("a_frac") != ".") & (pl.col("b_frac") != ".") & (pl.col("score") >= min_score)
        )
        .with_columns(
            [
                pl.col("a_frac").str.split(",").alias("tmp.a_split"),
                pl.col("b_frac").str.split(",").alias("tmp.b_split"),
            ]
        )
        .with_columns(
            [
                pl.col("tmp.a_split")
                .list.first()
                .str.extract(pattern=pattern)
                .alias("tmp.raw_a_h_pct"),
                pl.col("tmp.a_split")
                .list.last()
                .str.extract(pattern=pattern)
                .alias("tmp.raw_a_m_pct"),
                pl.col("tmp.b_split")
                .list.first()
                .str.extract(pattern=pattern)
                .alias("tmp.raw_b_h_pct"),
                pl.col("tmp.b_split")
                .list.last()
                .str.extract(pattern=pattern)
                .alias("tmp.raw_b_m_pct"),
            ]
        )
        .with_columns(
            [
                pl.col("tmp.raw_a_h_pct").cast(pl.Float64).alias("h_a"),
                pl.col("tmp.raw_a_m_pct").cast(pl.Float64).alias("m_a"),
                pl.col("tmp.raw_b_h_pct").cast(pl.Float64).alias("h_b"),
                pl.col("tmp.raw_b_m_pct").cast(pl.Float64).alias("m_b"),
            ]
        )
        .select(~cs.matches("tmp*"))
        .collect()
    )
    return X_dmr_cleaned

X_dmr_filt = etl_dmr_bed(X_dmr) # optionally with min_score == 0 to keep everything
Ge0rges commented 10 months ago

Great! Thank you for this thorough answer. I'm definitely interested in moving away from methylkit which has been behind the many questions.

ArtRand commented 10 months ago

I imagine there will be a fair number of people who are familiar with BS-Seq packages, so any feedback that helps that transition is appreciated.

Ge0rges commented 10 months ago

I would say the main thing that methylkit does which modkit lacks and is in scope for it, is dmr across multiple samples rather than just pairwise comparison. So the calculation of the mean methylation patterns across samples, and the dmr between each sample and that baseline.

Ge0rges commented 10 months ago

@ArtRand I think I didn't get what you meant by

I think you could use a valid coverage filter for min coverage

as I don't see a flag for that. Did you mean simply filtering the input bam file for reads that have a certain coverage only? I was wondering if I would end up losing some data, versus looking at coverage only at methylated positions, but maybe I haven't wrapped my head around this well.

Out of curiosity (and much anticipation) I was wondering when the next release is planned for?

ArtRand commented 10 months ago

@Ge0rges sorry for the slow response,

If I recall correctly, I was suggesting that you can use the N_valid_coverage column in the bedMethyl for any filtering on coverage (i.e. number of reads reporting methylation status at a site). There is no flag for this, so you'll have to do it on the files yourself ahead of modkit dmr or MethylKit analysis. I'm going to make this an command line option in the next version for dmr along with support for testing individual bases instead of regions.

Release would normally be this week, but it's a holiday week in the US - so probably early next week.

Ge0rges commented 10 months ago

Not slow! Thank you for there reply. Happy thanksgiving!