nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
127 stars 7 forks source link

polars script modkit extract #133

Open Macdot3 opened 6 months ago

Macdot3 commented 6 months ago

Hi, I am attempting to integrate the methylation percentage data obtained from pileup discussed in #129 with the data on read_position from modkit extract. Applying polars code does not yield results because I have forward_read_position values exceeding 200. This happens with many samples. How can I adapt your code to address this? Here my R code:

library(readr)
library(dplyr)
X_calls <- read_delim("Linfo_mt6394_calls_extract_m.tsv", delim = "\t")

X_aligned <- filter(X_calls, X_calls$within_alignment %in% "TRUE" & X_calls$fail %in% "FALSE") %>% 
             mutate(forward_aligned_position = forward_read_position - fw_soft_clipped_start, modified_call = call_code != "-")

read_position_max = 200

X_aligned_position_mean_methylation <- X_aligned %>%
  filter(forward_read_position < ) %>%
  group_by(forward_aligned_position) %>%
  summarize(mean_methylation = mean(modified_call)) %>%
  arrange(forward_aligned_position)

Thanks for support #87

ArtRand commented 6 months ago

Hello @Macdot3,

You can skip filtering on read_position_max so the code from #87 becomes


import polars as pl

X_calls = pl.read_csv("modkit_calls.tsv", separator="\t")

X_aligned = X_calls.filter(pl.col("within_alignment") & ~pl.col("fail"))\
    .with_columns([
        (pl.col("forward_read_position") - pl.col("fw_soft_clipped_start")).alias("forward_aligned_position"),
        (pl.col("call_code") != "-").alias("modified_call")
    ])

X_aligned_position_mean_methylation = X_aligned\
    .group_by("forward_aligned_position").agg([
        pl.col("modified_call").mean().alias("mean_methylation")
    ])\
    .sort("forward_aligned_position")

it is worth keeping in mind that this code will calculate the %-modification at each read position combining all modifications together (i.e. it won't separate 5hmC and 5mC the way the pileup does).

ArtRand commented 5 months ago

@Macdot3 any luck here?

Macdot3 commented 5 months ago

Hi Arthur, after your advice I managed to determine the % methylation for read_position but I get few reads precisely because this code will calculate the % change in each read position by combining all the changes together. So I'm currently trying the pileup approach and seeing if I can adapt your code. Do you have any other advice on the matter? I am also trying to calculate the percentage in this way: first determine which and how many bases of the mitochondrial genome I find in forward_read_position and then on the basis of the coverage find the percentage of methylation. It can be useful? I accept all useful advice. Thank you very much for your availability

ArtRand commented 5 months ago

Hello @Macdot3,

I'm a little unclear about what you're trying to do, exactly. Are you trying to determine the genome positions with base modifications in the mitochondrial genome? Are you trying to find reads that have high levels of modification (as compared to low levels)? I'm not sure from your description.

Could you tell me the number you're trying to calculate?/

In general, pileup is for calculating the frequency of modification at each genomic position. I.e. it will tell you at position X there Y% of the reads report a modification. On the other hand extract is for analysis of read-level information. I.e. which reads have >90% modification or is there a correspondence between read position and modification calling, etc.

Macdot3 commented 5 months ago

Hello Arthur, I apologize if I haven't been very clear. I'll try to explain myself better, and if it's still unclear, please let me know. I want to calculate the percentage of methylation at each modified base called on each sequenced read position of the mitochondrial genome. I don't need to know whether the y% of reads is methylated or not at position x, or rather it could be something additional but it's not my priority.

Let me also explain what I'm trying to calculate. I obtained the total coverage using samtools depth for my samples for each position of the mitochondrial genome. Then I extracted the values of forward_read_position from the tsv output of modkit extract. With a small function, I determined how many times the positions of the mitochondrial genome are present in forward_read_position and reported the values in a column that I called reads_meth. Finally, I calculated the percentage of methylation for each base of the genome by dividing the values of reads_meth by COV_TOT (reads_meth/COV_TOT). I realize that perhaps the explanation is a bit convoluted, and I hope you understood. Does any of this align with what you mentioned? Thank you very much for your support.

ArtRand commented 5 months ago

@Macdot3 I really think what you want is the output from modkit pileup.

I want to calculate the percentage of methylation at each modified base called on each sequenced read position of the mitochondrial genome.

For a given sequencing read, any position on that read will be called as having a base modification or canonical. The output of modkit pileup aggregates the reads at every genomic position and counts how many reads were called as having a modified base or canonical base aligned to that genomic position. The same way as samtools depth (or more accurately samtools mpileup) will tell you how many reads have matching/mismatching bases or other edits at every position across the genome.

Could you define the quantity you're trying to calculate in terms of the columns output by modkit pileup?

Macdot3 commented 5 months ago

Hi Arthur, I greatly appreciate your insights. I've been working on calculating a percentage value, experimenting with various approaches. It seems that the percentage I'm aiming for can be somewhat analogous to the fraction modified column (Nmod / Nvalid_cov) in the output of modkit pileup (without applying any filters to generate the bed files). However, I've encountered some confusion regarding the values I've been observing. Specifically, Nvalid_cov, which I believe represents the coverage value for the modified base (please correct me if I'm wrong), didn't align with the COV_TOT value obtained using samtools. Furthermore, the values produced by the "fraction modified" column left me puzzled. For instance, I noticed readings of 100.0 in certain positions, whereas I anticipated much lower values. Is it possible that I misinterpreted something along the way? Due to these discrepancies, I've decided to explore the modkit extract route. Once again, I'm turning to you for guidance. Thank you sincerely for your assistance.

ArtRand commented 5 months ago

Hello @Macdot3,

I'm pretty sure you want the fraction modified value, column 11. The value for N_valid can differ from samtools depth for a couple reasons:

  1. A read's modification call is filtered out due to low confidence.
  2. A read base does not match the expected base for the modification (in this case it will be tabulated in the N_diff column).

It's hard for me to comment on why a position might have a fraction modified value different from what you would expect. One idea is that if a position has just one or two reads (N_valid equals 1 or 2), it's possible that just by chance that one read calls a modified base. In that case you'll get 100% modified, but with just 1 read it's hard to make any conclusions.