nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
439 stars 53 forks source link

barcoding_summary.txt issue #872

Open SalvadorGJ opened 3 weeks ago

SalvadorGJ commented 3 weeks ago

Hello,

I found an issue where the barcoding_summary.txt generated by dorado demux is not working properly while using the --no-classify. I'm running the barcoding from the dorado basecaller step, and that is why I don't want to repeat it. Although it seems that the reads are correctly split in different barcoded files, the barcoding_summary.txt tags all the reads as unclassified. I'm attaching the first part of the summary, to show that also the file name is not reported (in my case it is a single BAM file output from dorado basecaller):

filename    read_id barcode
    ce4036f3-848b-4eff-b1ed-758ff1ac9ece    unclassified
    91f29081-24dd-4b2c-a9f7-6ddc7f8c8197    unclassified
    00b321c2-1b31-4b4d-86df-18339897cf85    unclassified
    9a518706-b88d-49b4-8287-db02dc3daf91    unclassified
    299e0e6d-167d-4e6a-903d-c2ac05c97460    unclassified
    b5b75cfb-02e5-46f5-84d6-dbb439d705de    unclassified
    24d56d8e-e04b-4bb8-97f9-6f55cf0ca080    unclassified
    22d929d5-332b-4ac2-b91a-d5ead40c552c    unclassified
    cede8bc3-7b6e-4bcf-b65a-69c76e277849    unclassified

Thanks a lot!

Salvador

Run environment:

dorado basecaller -r --kit-name SQK-NBD114-96 dna_r10.4.1_e8.2_400bps_sup@v5.0.0 chunk_pod5/ > FC-01.basecalls.bam
dorado demux -t 4 -o dorado_0.7.0_SQK-NBD114-96_demultiplexed/ --no-classify --emit-summary --emit-fastq FC-01.basecalls.bam

Here is the output of the basecaller: https://www.dropbox.com/scl/fi/d7tog6e9v685y6l4nmvgf/FC-01.basecalls.bam?rlkey=33gyxmciwba0je5letzlupobx&st=b8c2sloc&dl=0 Let me know if you need the pod5 files to reproduce it from the basecaller step.

Logs

[2024-05-28 01:54:21.048] [info] Running: "basecaller" "-r" "--kit-name" "SQK-NBD114-96" "dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "chunk_pod5/"
[2024-05-28 01:54:21.190] [info] > Creating basecall pipeline
[2024-05-28 01:54:29.727] [info] cuda:0 using chunk size 12288, batch size 416
[2024-05-28 01:54:30.350] [info] cuda:0 using chunk size 6144, batch size 512
[2024-05-28 03:04:06.735] [info] > Simplex reads basecalled: 1362141
[2024-05-28 03:04:06.751] [info] > Simplex reads filtered: 51
[2024-05-28 03:04:06.751] [info] > Basecalled @ Samples/s: 5.357987e+06
[2024-05-28 03:04:06.751] [info] > 1379293 reads demuxed @ classifications/s: 3.303233e+02
[2024-05-28 03:04:07.556] [info] > Finished
[2024-05-28 03:04:39.313] [info] Running: "demux" "-t" "4" "-o" "dorado_0.7.0_SQK-NBD114-96_demultiplexed/" "--no-classify" "--emit-summary" "--emit-fastq" "FC-01.basecalls.bam"
[2024-05-28 03:04:39.318] [info] num input files: 1
[2024-05-28 03:04:39.321] [info] > starting barcode demuxing
[2024-05-28 03:05:04.280] [info] > Simplex reads basecalled: 1379293
[2024-05-28 03:05:04.280] [info] > finished barcode demuxing
[2024-05-28 03:05:04.280] [info] > generating summary file
[2024-05-28 03:05:11.188] [info] > summary file complete.
malton-ont commented 2 weeks ago

Hi @SalvadorGJ,

Thanks for raising this issue. I'm able to reproduce it with the data you provided. This appears to be related to the use of the --emit-fastq flag - when the result is output to bam format there is no issue. We'll try to get a fix out for a future version of dorado. In the mean time, you could drop the --emit-fastq flag and generate a bam files instead, and then use samtools to convert these to fastq if you require that format.