nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
534 stars 64 forks source link

Large peak of methylated cytosine output from dorado modified base calls in R9.4.1 samples #1041

Closed eesiribloom closed 1 month ago

eesiribloom commented 1 month ago

Issue Report

Please describe the issue:

I have samples basecalled with dorado software (v0.4.1) with detection of 5mCG_5hmCG modifications enabled using either:

dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r9.4.1_e8_hac@v3.3

...depending on the pore chemistry/flow cells used to sequence the data.

I've noticed the two "batches" have distinct distribution of probabilities in base modifications using modkit sample-probs. In particular the R9.4.1 samples have a massive peak of C:m at the far right of the histogram (second plot). This difference is noticeable on all the samples sequenced and basecalled with the two respective pore chemistries and basecalling models.

I'm wondering what the explanation for this would be. Is this expected with the difference in pore chemistries / models used for basecalling? The large peak in the R9.4.1 samples looks like some sort of artefact - I can't think of a biological explanation and this is not what I'd expect in such samples (human cancer samples).

Counts (1) Counts

Run environment:


- Operating system: 
Scientific Linux 7
- Hardware (CPUs, Memory, GPUs): 
Basecalling was performed using 4 NVIDIA A100 GPUs and max 256 GB system RAM 
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): 
R9.4.1 data was converted to a single .pod5 file before basecalling. R10.4.1 was basecalled directly from .pod5 files
- Source data location (on device or networked drive - NFS, etc.): 
Network file storage
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 
Data was prepared using SQK-LSK110 and sequenced using a PromethION flow cell (R9.4)
OR prepared using SQK-LSK114 and sequenced using a PromethION flow cell (R10.4.1)
Human cancer data ranging from 25Gb-180Gb /sample 
HalfPhoton commented 1 month ago

Hi @eesiribloom, Closing this ticket as it's a duplicate of your ticket in ModKit@259 which has been picked up by the mods team.

Kind regards, Rich