Issue Report

Please describe the issue:

I have samples basecalled with dorado software (v0.4.1) with detection of 5mCG_5hmCG modifications enabled using either:

dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r9.4.1_e8_hac@v3.3

...depending on the pore chemistry/flow cells used to sequence the data.

I've noticed the two "batches" have distinct distribution of probabilities in base modifications using modkit sample-probs. In particular the R9.4.1 samples have a massive peak of C:m at the far right of the histogram (second plot). This difference is noticeable on all the samples sequenced and basecalled with the two respective pore chemistries and basecalling models.

I'm wondering what the explanation for this would be. Is this expected with the difference in pore chemistries / models used for basecalling? The large peak in the R9.4.1 samples looks like some sort of artefact - I can't think of a biological explanation and this is not what I'd expect in such samples (human cancer samples).

Counts (1) Counts

Run environment:

Dorado version: 0.4.1

Dorado command:

dorado basecaller \
dna_r9.4.1_e8_hac@v3.3 \
sample1_converted.pod5 --modified-bases 5mCG_5hmCG --recursive > sample1_mod.bam


dorado basecaller \
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 \
path/to/sample2 --modified-bases 5mCG_5hmCG --recursive > sample2_mod.bam


- Operating system: 
Scientific Linux 7
- Hardware (CPUs, Memory, GPUs): 
Basecalling was performed using 4 NVIDIA A100 GPUs and max 256 GB system RAM 
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): 
R9.4.1 data was converted to a single .pod5 file before basecalling. R10.4.1 was basecalled directly from .pod5 files
- Source data location (on device or networked drive - NFS, etc.): 
Network file storage
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 
Data was prepared using SQK-LSK110 and sequenced using a PromethION flow cell (R9.4)
OR prepared using SQK-LSK114 and sequenced using a PromethION flow cell (R10.4.1)
Human cancer data ranging from 25Gb-180Gb /sample

nanoporetech / dorado

Large peak of methylated cytosine output from dorado modified base calls in R9.4.1 samples #1041

Issue Report

Please describe the issue:

Run environment: