nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
488 stars 59 forks source link

Modified basecalls seem to aggregate at the start of each chromosome #305

Closed SophieValk closed 5 months ago

SophieValk commented 1 year ago

Hi all,

I have been using megalodon so far for calling 5mC modified bases. Since the new release of Dorado I have been exited to try and compare it to my earlier called modification results. As I have data from an R10.4 flowcell in 4khz I have first converted the fast5 files to pod5 files, after which I used the following commands for calling 5mC:

dorado basecaller  --recursive  dna_r10.4.1_e8.2_400bps_sup@v4.1.0 pod5_pass/ --modified-bases-models dna_r10.4.1_e8.2_400bps_sup@v4.1.0_5mCG_5hmCG@v2  > data.unmapped.bam

After dorado basecalling I have mapped the data.unmapped.bam reads to the reference genome, sorted and indexed using samtools 1.17. After this I converted the mapped_sorted.bam file to bed files with bam2bed. I created three bed files, one containing all modified basecalls, one containing all modified bases with a percentage >80% (hypermethylated), and one containing all modified bases with a percentage <20% (hypomethylated).

Now, when I visualize these .bed files in IGV and compare them with the Megalodon modified.bed files output, the dorado modified bases seem to aggregate at the start of each chromosome.

Screenshot_Coverage_ModifiedbasecallesDoradoandMegalodon_21-07-2023 The first track shows the Megalodon modified basecalls, the second track shows the hypermethylated dorado basecalls, the third track shows the hypomethylated dorado basecalls, and the fourth track shows all modified basecalls from dorado.

In the picture below you can see the content of one such hypermethylated file:

Content_dorado_hyptermethylated_5mc_cpg_bedfile

Maybe there is someone that can tell me what I am doing wrong? I cannot figure it out.

All the best, Sophie

marcus1487 commented 1 year ago

It would be good to understand the initial results first before getting into the reads separated by methylation status. It seems that the BED file "containing all modified basecalls" is just all reads? Or is there some filtering here? Or conversion to extract positions from reads? Or something else? Have you visualized just the BAM files themselves to remove any issues before the bed file conversion and methylation fraction?

Dorado now has reference mapping built in. It would be good to test this out to simplify your workflow.

marcus1487 commented 1 year ago

I would also note here that the basecalls and mappings with Remora models (built into Dorado) do not change the canonical bases at all. So the mappings you are getting are not dependent upon the modified base calls. This is a divergence from Megalodon where modified base models were incorporated into the basecaller and thus changed the canonical basecalls for a read. Thus (unless I am misunderstanding your analysis) it seems that your issue is related to the canonical basecalling and mapping and not the modified base calls. Or at least that these should be confirmed before we investigate the modified base observations you've made.

SophieValk commented 1 year ago

Dear @marcus1487, thanks for your reply! It indeed makes sense to first understand the initial results before moving on to selecting on methylation status. I have tried visualizing the .bam files right after mapping with minimap2, however, this file shows no data in IGV. I am now trying calling only the canonical bases with dorado (using also the builtin alignment tool) to see how this file aligns in IGV. I will get back with the results as soon as possible.

SophieValk commented 1 year ago

Hi, I seem to have some problems with visualizing .bam files in IGV. I tried their online version and the dorado output file (mapped with builtin minimap2) shows alignments. I will try to figure out what problems I have with IGV, maybe I need to try to recover some older versions (am now using 2.16.2 for Linux). I will also try to figure out what happens to the modified bases now that the canonical basecalling seems right.

Screenshot_IGVOnline_24-07-2023