nanoporetech / rerio

Research release basecalling models and configurations
https://nanoporetech.com/
Other
102 stars 9 forks source link

Many methylation probabilities close to 0.5 using all-context model #12

Closed koenvandenberge closed 4 months ago

koenvandenberge commented 3 years ago

We are using megalodon for methylation calling and it's been working well for us. We have been exploring the results, and noticed that in the per_read_modified_base_calls.db file, the majority of methylation probabilities are close to 0.5, such as the figure below. image

It seems to us like these are noise, i.e., derived from nucleotides where the methylation calling model is unsure whether the base is methylated or not, and therefore sets a probability close to 50%. It would be good to know whether this intuition is correct and, if possible, if we can avoid these. Due to the high number of nucleotides with 50% methylation probability, our background methylation probability is now around 30-40%, while we would biologically expect it to be quite a bit lower.

marcus1487 commented 3 years ago

I have not observed this phenomenon. A little bit of information would be helpful to diagnose this issue.

The first would be the model used. Each model has a unique distribution of modified base probabilities. You can see the distributions used in model calibration for all released models in the Megalodon repo (example here).

The second question I have is exactly how the above probabilities were extracted from the per-read database. Could you provide a minimal bit of code that was used to extract these modified base probabilities?

mollybrothers commented 3 years ago

Hi Marcus,

I can provide some context (Koen and I are working together).

  1. We used the all-context rerio model res_dna_r941_min_modbases-all-context_v001.cfg. Looking at that model's distribution of mod probs (thanks for pointing us in the right direction), it doesn't seem to have a bump at 50% like we're seeing. We're working with budding yeast, which doesn't have any endogenous DNA methyltransferases, which is why we expected a very low background. (Luckily the signal we are looking for in our experiment does stick out very well, we are just curious about the 'high' background).

  2. I don't think we actually used the .db file, but rather the per_read_modified_base_calls.txt file and plotted the mod_log_prob frequencies. Koen can provide more info if I got this wrong.

marcus1487 commented 3 years ago

One point to double check here is that these log probabilities are natural logs (and not log 10). This might cause such a shift away from more confident calls.

For the high background, the all-context model is a bit more error prone than other models (though this does seem a bit high still for a 100% canonical control). The 5mC model has higher accuracy (if you are looking at 5mC and not 6mA) and a new version with better all-context performance should be released soon. The 6mA all-context model is not currently being updated, but this is on the medium term targets. Hopefully an updated model might resolve some of this high background.

marcus1487 commented 3 years ago

Ah. Also looking back at the plots for the all-context model, the chunk of calls at 50% does make sense. The plots are a bit tricky to interpret (and the all-context model has a bit of an irregular spread due to some unreleased algorithm tricks to train this model). So the top facet on these plots are the density of the ground truth calibration data. The bottom facet shows the calibrated log likelihood ratio (so 0 == 50% probability). If you take the region of the x-axis where the log likelihood ratio is near 0 (50% probability) and project this onto the top facet, you get a lot of data near 50%.

In the legend of the plot you can also see the fraction of data filtered out at each threshold level. Note that at the 75% probability threshold 82% of the data is filtered out (compare to 9% for res_dna_r941_min_modbases_5mC_CpG_v001.cfg). In other words, this value indicates that 82% of calls fall between 25% and 75% probability and would be filtered out in aggregation.

It seems you have used the model appropriately if you are finding the ground true signal coming out with more confident calls. I found that there was good signal from this model, but that there are many sites where the model is not confident one way or the other. I can confirm that this is due to the model and that an updated model should resolve this issue.

Given that this is a model issue, I will transfer this issue over to the Rerio repo.

mollybrothers commented 3 years ago

Thanks for the great explanation, Marcus. We are looking for m6A DNA methylation, so we'll keep an eye out for any updates to the rerio models.

I want to make sure I understand one point you made – so the per_read_modified_base_calls.txt file contains all data, nothing filtered out, correct? And sending it through aggregation tosses anything between 25%-75% probability and calls <25% as unmethylated and >75% methylated to give the information in the bedMethyl file? This might explain one other question I've had looking through the .bed file – the coverage of each adenine seems to vary quite a bit, even for adenines close together. Could this be due to the 82% of calls being thrown out in the aggregation step?

marcus1487 commented 3 years ago

Yes, the database and per-read text files contain all data. Nothing is filtered from these files (except for any —mod-motif s specified during initial processing). Yes, anything between 20% and 80% (default threshold is 80% not 75%) is tossed for aggregation using the default method. And this is indeed the reason for the difference in coverage in the bedmethyl file. Ive been meaning to fix the bedmethyl file so that the score field holds the total coverage and the second to last field holds the coverage contributing to the reported percent methylation.

One additional note, the megalodon threshold method ignores any call where the called base (canonical or methylated) is below the threshold level (80% by default). This doesn’t make a difference for this model but does for models like the 5mC+5hmC model with more than one alternative to a single canonical base.

amaslan commented 2 years ago

Hi Marcus, This thread was very helpful for us. We're seeing the same 0.5 bump with Megalodon using the rerio all-context model for calling mA. However, we do not see it with Guppy using the same model (see comparison below, you can ignore the barcodes). Is this difference due to how the reported probabilities are calculated for Megalodon vs. Guppy? Is there a way to convert the reported probabilities to an equivalent p(mA) for both basecallers?

guppy_meg

We see for both basecallers that the FPR is extremely low for calling mA. Even with a threshold of 1/255, we calculate FPR ~1% using an unmethylated control. We are trying to justify probability threshold choice and this low FPR would suggest that all mA calls should be included. How was the 0.8 default threshold chosen?

image

davidnewman02 commented 4 months ago

Closing this issue due to inactivity. Please checkout the latest developments in Dorado and Remora and feel free to open a new issue if there are still unresolved issues in the latest toolchains.