alexyfyf commented 1 week ago

Issue Report

Please describe the issue:

Dorado basecall identified the barcode without trimming, but subsequent demux also did not trimm reads. Is this the expected behaviour? Is it possible to keep all sequence in basecalling, but remove barcode and adapter in demux?

Steps to reproduce the issue:

I have run dorado 0.7.0 to basecall and demux pod5 files. I used the following command for basecalling

dorado basecaller sup $pod5 --kit-name SQK-NBD114-24 --no-trim > ${dir}/basecalled_reads.bam

The bam file contains the basecalled reads with Nanopore adapter and barcode information. The I ran demux

dorado demux --threads 16 --output-dir ${dir}/demux --no-classify  ${dir}/basecalled_reads.bam

This time I did not ask for --no-trim and I assume barcode and primers will be removed, but the reads are exactly the same as in the previous bam, essentially demux just split them into separate files.

Run environment:

Dorado version: 0.7.0
Dorado command: see above
Operating system: Linux 3.10.0-1160.99.1.el7.x86_64 x86_64
Hardware (CPUs, Memory, GPUs):
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): on device
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): amplicon data with customer PCR primer instead of polyT primer, otherwise same read structure.
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

Basecall logs

[2024-06-10 14:17:57.810] [info] Running: "basecaller" "sup" "/vast/scratch/users/yan.a/vast_scratch/20240326_stVincent_Max_MitocDNAPool/MitocDNA/20240326_0408_2G_PAW21555_6af78782/merged.pod5" "--kit-name" "SQK-NBD114-24" "--no-trim"
[2024-06-10 14:17:57.925] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-06-10 14:17:57.929] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0 with httplib
[2024-06-10 14:18:05.855] [info] > Creating basecall pipeline
[2024-06-10 14:18:17.850] [info] cuda:1 using chunk size 12288, batch size 288
[2024-06-10 14:18:17.850] [info] cuda:3 using chunk size 12288, batch size 256
[2024-06-10 14:18:17.850] [info] cuda:0 using chunk size 12288, batch size 288
[2024-06-10 14:18:17.850] [info] cuda:2 using chunk size 12288, batch size 288
[2024-06-10 14:18:18.636] [info] cuda:1 using chunk size 6144, batch size 288
[2024-06-10 14:18:18.639] [info] cuda:2 using chunk size 6144, batch size 512
[2024-06-10 14:18:18.639] [info] cuda:0 using chunk size 6144, batch size 448
[2024-06-10 14:18:18.639] [info] cuda:3 using chunk size 6144, batch size 512
[2024-06-12 02:49:19.138] [info] > Simplex reads basecalled: 64370266
[2024-06-12 02:49:19.139] [info] > Simplex reads filtered: 2266
[2024-06-12 02:49:19.139] [info] > Basecalled @ Samples/s: 1.326986e+07
[2024-06-12 02:49:19.139] [info] > 66273811 reads demuxed @ classifications/s: 5.041388e+02
[2024-06-12 02:49:27.816] [info] > Finished
[2024-06-12 02:49:30.074] [info] Running: "summary" "/vast/scratch/users/yan.a/vast_scratch/20240326_stVincent_Max_MitocDNAPool/MitocDNA/20240326_0408_2G_PAW21555_6af78782/basecalled_reads.bam"

malton-ont commented 1 week ago

Hi @alexyfyf,

Yes, this is expected behaviour. demux does not perform trimming unless it is also classifying. If you want to have untrimmed reads after basecalling, you will need to run the basecaller without --kit-name, and then classify during demux:

dorado basecaller sup $pod5 --no-trim > ${dir}/basecalled_reads.bam
dorado demux --threads 16 --output-dir ${dir}/demux  --kit-name SQK-NBD114-24  ${dir}/basecalled_reads.bam

alexyfyf commented 1 week ago

@malton-ont thank you for your reply. I'm still a bit confused. So does your basecall command generate the same file as mine (maybe only differ in some tags specifying barcode information, and read sequences should be identical)? One more question, from what I searched from the your github issues, seems classifying in demux usually generates more usable reads, is that what you observe as well?

Cheers,

malton-ont commented 1 week ago

@alexyfyf,

Yes, the only difference after the basecaller command would be the BC tags being present or not, and the RG tags will be more detailed and specific if barcoding is performed during basecalling (when barcoding with basecalling we can create read groups for the individual barcodes, while demux does not update the read tags).

There should be no real difference between the two methods regarding the sequences or other tags.

ireneortega commented 2 days ago

Hi @alexyfyf,

Yes, this is expected behaviour. demux does not perform trimming unless it is also classifying. If you want to have untrimmed reads after basecalling, you will need to run the basecaller without --kit-name, and then classify during demux:
dorado basecaller sup $pod5 --no-trim > ${dir}/basecalled_reads.bam
dorado demux --threads 16 --output-dir ${dir}/demux  --kit-name SQK-NBD114-24  ${dir}/basecalled_reads.bam

In the manual, in the section Barcode Classification > Classifying existing datasets it says: "As with the in-line mode,--no-trimand --barcode-both-ends are also available as additional options." Does it mean that dorado demux perform trimming of barcodes, adapters and primers by default? I am confused with your comment: demux does not perform trimming unless it is also classifying

nanoporetech / dorado

Dorado basecall identified barcode but not trimmed in demux #898

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs