nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
140 stars 8 forks source link

Strange bias between percent_modified and valid_coverage from modkit pileup #262

Open flofloflo-cell opened 1 month ago

flofloflo-cell commented 1 month ago

Hi,

I am interested in context specific methylation events and I am seeing some strange association between the predicted percentage of modification and the coverage at a given site. I followed the default dorado > modkit pipeline from the nanopore documentation (auto model selection from dorado and auto threshold setting from modkit).

On the plot below, intermediate levels of modifications become less frequent as the coverage goes up.

image

Thanks!!

Theo-Nelson commented 1 month ago

Hi flofloflo-cell,

Not the developer, but I think for this issue it would be helpful if you a.] posted the modkit filtering thresholds that were calculated and b.] run modkit sample-probs as specified here to see if you have lower quality modification calls within your analysis (https://github.com/nanoporetech/dorado/issues/951#issuecomment-2262628434)

Feel free to post the sampled histograms for feedback. Best of luck!

Sincerely, Theo

flofloflo-cell commented 1 month ago

Hi @Theo-Nelson,

Thanks for the feedback. For this sample, I got the following filtering thresholds calculated by modkit:

I ran the sample-probs command and plotted the probabilities.tsv output: image

As a test, tried increasing the --filter-threshold to 0.9 for A and replicated the plot from my previous post using only As but I am still getting the same pattern: image

Best, Flo

ArtRand commented 1 month ago

Hello @flofloflo-cell,

I ran a quick simulation, and it seems that the pattern you're seeing is expected. Here's how it works:

\displaylines{
  \textbf{C} \sim \text{Pois}(100) \\
  \textbf{M} \sim \text{Beta}(0.5, 0.5) \\
  n_{x} \leftarrow \textbf{C} \\
  p_{x} \leftarrow \textbf{M} \\
  \text{me}_{x} \leftarrow \text{B}(n_{x}, p_{x}) \\
  \text{\%mod}_{x} = \frac{me_{x}}{n_{x}} * 100%
} 

Where

I simulated 30K sites.

If we look at the marginal distributions of the coverage and the percent modification, we can see the expected pattern given the simulation setup, including a lower, but present, density of sites with intermediate levels of methylation.

image

When I attempt to make your heatmap, I can see a qualitatively similar pattern (although I think the strange semi-circles are just an artifact of the binning):

image

Overall, my explanation for the pattern is that the intermediate $p$ values around 50% methylation are rare, so the joint probability of intermediate $p{x}$ and high (or low) $n{x}$ is also quite low. That's why you see the half-moon shape in the middle. To see intermediate methylation levels at higher coverage you'll need very high coverage, but of course these sites will still only constitute a tiny fraction of the population. I think the shading in your plots aren't quite showing the range of density (not claiming that mine are perfect either).

Your probability histograms look completely normal to me. As an aside, the more recent versions of modkit will produce html plots for you when you use --hist.

flofloflo-cell commented 1 month ago

Hello @ArtRand,

Thank you for the reply. Running a simulation is definitely a great idea. The circular patterns/strange semi-circles on the left of the plot are indeed an artifact due to depletion of potential values taken by percent_modified as the valid_coverage goes down.

Nevertheless, I feel that the patterns in simulation and the observed data are not exactly similar. I placed two boxes on each plot, centered around the mean of valid_coverage, which in my opinion shows that the simulation pattern is more symmetrical than the observed data. image

I also reproduced the simulation in order to compare it to the observed data more accurately (except that I resampled from the observed valid_coverage instead of the poisson distribution to get a more comparable spread on the x-axis) image

Intermediate methylation levels are indeed rare in my dataset and the simulation plot definitely shows that the joint probability of intermediate px and high/low nx is low. In the observed data plot, the joint probability of intermediate px seems higher at low nx than at high nx which does not feel intuitive at all. If intermediate methylation can already be observed at the lower coverage, it is also not intuitive that higher sequencing depth is required to observe intermediate methylation as higher coverage.

To provide another example, below is the data (from the same dataset) but for a specific m6a target motif (~15,000 adenines). Positions with methylation < 80% represent ~5% (>700 adenines) of the dataset and he same asymmetry can be observed. image

Thanks for the help!

Best, Flo

marcus1487 commented 1 month ago

You have to match both marginal distributions in your simulation. Could you also sample from the percent methylated distribution? This will give a good indication if there is a linked effect between coverage and percent methylation or if the simulated distribution matches the observed.

flofloflo-cell commented 1 month ago

Hi @marcus1487,

That's a good point since the distribution of the observed percent_modified does look quite different from a beta(0.5, 0.5) distribution. I have reproduced the plot, sampling from both marginal distributions (also added side histograms and the median coverage as a red dotted line, as reference).

The simulated pattern got closer to the observed data, but the intermediate methylation region looks much more symmetrical between higher and lower coverage in the simulation compared to observed, in particular when looking at 50-75x versus 125-150x depth.

image

Best, Flo

ArtRand commented 1 month ago

@flofloflo-cell,

Do these plots have 6mA and 5mC(G) calls bedMethyl records combined? The percent modified distribution makes me think you're mostly looking at 6mA calls (obviously there will be many more 6mA calls than CpG calls).

Another basic question, does the pattern change when you use the --no-filtering during modkit pileup? It's possible that the intermediate $p_x$ contexts are lower confidence and more often filtered out.

I agree that the latest pair of plots don't look similar, but it does feel like we're looking for subtle changes in the tails of these distributions. It might be worth running the simulation to the same number of samples as the observed, I can see that the highest counts in the observed plot is ~3x higher than the simulation. To be honest, however, I don't expect the overall pattern to change much. Maybe also drop the column of zero percent-methylation to see more of the graduation in the tail.

Let's look at just one modification at a time (6mA or 5mC), maybe you already have this, and try with --no-filtering and see if the pattern persists.

flofloflo-cell commented 1 month ago

@ArtRand

Thanks for all the suggestions!

I started with the separate plots for m6a and m5c: image

Then I generated the same plots with the --no-filtering option, which definitely look quite different and much more symmetrical: image

I plotted 'with' against 'without' filtering for percent_modified and valid_coverage to understand where the shift is coming from. With the --no-filtering options, there is both a shift toward higher coverage and intermediate methylation that seem to fill up the gap. So as you mentioned, the intermediate px contexts seem to have a lower confidence and are getting more often filtered out, leading to an overall lower coverage.

In your opinion, considering how much data is getting filtering out, are the intermediate methylation values biologically meaningful or some sampling bias artifact resulting from the uncertainty of the basecalling model for these specific contexts? image

Best, Flo

ArtRand commented 3 weeks ago

Hello @flofloflo-cell,

Thanks for performing this analysis. What appears to be happening (as I'm sure you've also noticed by now) is there are positions where the model is less confident, systematically. As a result, these positions are more likely to fall below the pass threshold, and have lower valid_coverage. These positions are lower accuracy so the calls seem to be closer to 50% (i.e the model is closer to guessing).

In your opinion, considering how much data is getting filtering out, are the intermediate methylation values biologically meaningful or some sampling bias artifact resulting from the uncertainty of the basecalling model for these specific contexts?

From your analysis I tend to think the latter. We can use this information to try and improve the models.