nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
197 stars 30 forks source link

confidence of identified CpG methylation #39

Closed michieitel closed 4 years ago

michieitel commented 4 years ago

Hi!

I was running megalodon v.2.1.1 (installed via conda) on my R9.4 reads from 1 MinION runs as well as 1 PromethION run (total roughly 100x coverage of my target genome).

I ran this command:

megalodon \
--guppy-config dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg \
--guppy-server-path=/home/ubuntu/tools/ont-guppy/bin/guppy_basecall_server \
--devices 0 1 \
--processes $NUMTH \
--guppy-params "--num_callers 8 --ipc_threads 12" \
--mod-motif Z CG 0 \
--outputs "mods" \
--mod-output-formats "bedmethyl"  \
--mod-binary-threshold 0.5 \
--guppy-timeout 10.0 \
--output-directory ${REF%.fasta}_${SET1}_megalodon_5MC \
--reference $REF \
--overwrite \
$FAST5 

This resulted in ~99.1% of all CpG positions being methylated when filtering for sites that occur in both runs (min & prom) even when being super stringent (--mod-binary-threshold 0.5).

This number seemed very high to me so I decided to test megalodon using a published Drosophila 1D PCR-free dataset (NCBI accession SRR8627923). Using the same settings as above resulted in 96% modified CpG, a number far off the published methylation % for Drosophila melanogaster. Since the authors of the Drosophila study also performed a MinION run on PCR amplified DNA I was able to compare the signals from both runs. The PCR library, which theoretically should show <1% methylation, however, showed 79% methylated CpG sites. The pure, noise free signal, accounts for roughly 17% methylated CpGs, a number closer but still way higher to what has been published for other adult Drosophila samples.

I am now wondering if (1) the noise in nanopore methylation calls is always that high that I HAVE TO run a PCR sample as negative control? (2) I have run a different model for the calls? (note: I got the exact same methylated CpGc for Drosophila when using 'res_dna_r941_min_modbases-all-context_v001.cfg' provided with the latest rerio repo)

Looking forward to some insights and advice on how to proceed with my own data.

Michael

marcus1487 commented 4 years ago

The first point here is that the --mod-binary-threshold value of 0.5 is less stringent than the default of 0.8. This value represents the minimum probability in order to include a modified base call in the aggregated output. Thus setting this value to 0.5 forces every site to be called for every read. For a model with only one alternative base per canonical base, 0.5 is equivalent to 0. Setting this value to 0.8 ignores uncertain calls in the range 0.5-0.8. Thus setting the value higher, but less than 1, would enforce a more stringent filter. See docs for the option here. Also see example cutoffs in the calibration plots for each model in this code base (see example here). You can create this plot with ground truth data using the megalodon_extras calibrate modified_bases (see docs here).

The second question would be how these percentages of methylation are being computed. Is this using a single threshold value on the percent methylation? What is this threshold value? I'm not sure exactly what is meant by "The pure, noise free signal, accounts for roughly 17% methylated CpGs". Could you elaborate on how this computation is different from the first set of computations?

Finally, for the highest accuracy CpG methylation calls, I would recommend the models from Rerio. These models show significant improvement over the guppy released modbase models and the all-context model for 5mC CpG methylation detection. We are working on a technical document to describe the advantages and recommended use cases for each model, but this is not currently released.

michieitel commented 4 years ago

Hey Marcus,

thanks for your help

1) Maybe I misunderstood your previous explanation (issue #22) where you mentioned to lower the threshold to reduce the number of methylated sites. I will run again with 0.8 (0.75 seems default according to the manual btw). 2) I ran megalodon on the Drosophila set as I did with mine using a single --mod-binary-threshold of 0.5. I then calculated the total number of CpG occurrences. The % I was referring to was then the ratio of mCpG/CpG. 3) So which model should I be using for the Drosophila and/or my sponge dataset? res_dna_r941_min/prom_modbases_5mC_CpG_v001.cfg?

thanks Michael

marcus1487 commented 4 years ago
  1. Issue #22 mentions that lowering the binary threshold would increase the number of per-read calls made and thus included in the aggregated results, but that this would increase the noise in each per-site call. The docs incorrectly have 0.75 as default. The command line help is the most reliable place to find the current defaults, but I will update the docs with this default.

  2. What exactly is the process to "calculated the total number of CpG occurrences"? The bedmethyl file contains an entry for each site covered by a valid modified base call. The last field contains the percentage of reads covering that position which are identified as modified/methylated. Thus counting all sites found in the file does not convert to the number of methylated sites. In order to count the number of completely methylated versus the number of completely canonical sites one might compare the number of site with percent methylation >95% with the number of site with percent methylation <5% for example. This level of aggregation depends a lot on users end goals.

  3. This will be included in a future technical document that I will try to remember to link here, but for 5mC CpG methylation I would recommend the res_dna_r941_min/prom_modbases_5mC_CpG_v001.cfg models. For 5mC or 6mA outside of CpG contexts I would recommend the all-context model in Rerio. The model in guppy is likely not optimal for any purpose, but may show better calls in E. coli 5mC and 6mA contexts, though the all-context model may show better performance here as well.

marcus1487 commented 4 years ago

Have the issues here been resolved?

michieitel commented 4 years ago

For now yes. Thanks. I will get back in case I have further questions.

Michael

daggoo commented 3 years ago

Hello! I am sorry if this is a bit off topic. @michieitel, you mentioned in your first post that you analyzed with megalodon amplified Drosophila DNA from the published dataset (SRR8627923) . I have checked the NCBI archive and see only PCR-free experiments. I was wondering whether you could provide a link to the amplified experiment? I am working with some Drosophila samples but do not have negative control yet.

michieitel commented 3 years ago

Hey @daggooo,

all reads of the study are associated to Bioproject PRJNA515844 in NCBI. The study can be found at http://dx.doi.org/10.1016/j.cell.2015.04.021

Best Michael

daggoo commented 3 years ago

I found the dataset. Thanks a lot, @michieitel !