nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
197 stars 30 forks source link

modified_bases.5mC.bed does not correlate with the mod_mappings.5mC.bam in IGV bisulfite mode #261

Closed simpsonbp closed 2 years ago

simpsonbp commented 2 years ago

The modified_bases.5mC.bed does not correlate at each base position with the mod_mappings.5mC.bam in IGV bisulfite mode with the commands below:

ont-guppy/bin/guppy_basecaller --device cuda:0 --gpu_runners_per_device 4 --input_path fast5_all --save_path output_folder/basecall_5mC_5hmC_3294 --fast5_out --post_out --config res_dna_r941_min_modbases_5mC_5hmC_v001.cfg

megalodon output_folder/basecall_5mC_3294/workspace --do-not-use-guppy-server --overwrite --outputs basecalls mappings mod_mappings mods --mod-motif Z CG 0 --mod-map-emulate-bisulfite --mod-map-base-conv C T --mod-map-base-conv Z C --mappings-format bam --reference mm38_72Q_166kb_px551_gRNA4+5_refig2.fasta --allow-supplementary-alignments --output-directory megalodon_results_supplementary --disable-mod-calibration --devices cuda:all:100% --processes 20

The guppy basecall server did not work on the backend with guppy 6.0.1 so I basecalled before running megalodon.

Why the differences between the two outputs?

marcus1487 commented 2 years ago

Could you elaborate on exactly how the two outputs do not correlate? Note that ambiguous calls (by default calls between 20% and 80% confidence) are filtered in the aggregation step, so some calls from the bam files will not be accounted for in the bed file.

simpsonbp commented 2 years ago

Sorry for the confusion--correlate to mean the two outputs do not match at each position.

I'm using the .bam in IGV to present the overall 5mC methylation broadly and the .bed for the % methylation quantification at each position.

For example: At position 175 in the bam in IGV there are 91 total counts: 40 are unmethylated and 13 are methylated. At position 175 in the bed there are 15 counts (column 10) and 0 methylation (column 11).

So by default the bed file is more accurate? Is there a way to change the default settings so the bam more closely matches the bed?

marcus1487 commented 2 years ago

The IGV visualization of bam methylation is not necessarily the most accurate representation of methylation. I believe IGV has a threshold over which a site is shown as methylated, but I believe the default is ~20% probability. This means that a site with 20% probability of being methylated will be shown as methylated. The better use for this bam files is to consider the shading of each call (representing the confidence in each call).

For the bed file ambiguous calls (calls < 80% confident by default) are ignored. This results in a better estimate of the ground truth fraction modified at each reference site and generally a relatively low proportion of filtered sites (<15-20%). The megalodon option --mod-binary-threshold can be set to 0 to force all reads at a site to contribute to the fraction modified estimate. This will allow the bam and bed to agree more closely, but likely be less representative of the ground truth.

simpsonbp commented 2 years ago

Very helpful! Thanks!