nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
525 stars 63 forks source link

Demultiplexing reads from Rapid Barcoding kit failed #987

Closed desmodus1984 closed 2 months ago

desmodus1984 commented 2 months ago

Hello,

I sequenced four samples with the Rapid Barcoding kit, and of course generated very little data (3GB), and an extra-run, generated 120 MB. I basecalled the extra-run using this code:

export OMP_NUM_THREADS=40
/home/juaguila/appz/dorado-0.7.3-linux-x64/bin/dorado basecaller --min-qscore 5 \
    --emit-fastq -x "cpu" --kit-name SQK-RBK114-24 \
    dna_r10.4.1_e8.2_400bps_sup@v4.3.0 /home/juaguila/Ju760-basecalling/RBK-xtra > RBK-xtra.fastq

DON'T INSIST - I DON'T HAVE ACCESS TO A GPU NODE, OR A GPU THAT IS COMPATIBLE WITH dorado.

I thought that the --kit-name would classify reads based on barcode and create a separate file for each barcode as guppy used to do, but it didn't. I got a useless single file. Furthermore, I don't understand the bam output, when most of the software use fastq files, so i used the --emit-fastq. Thus, now I have a problem, because I have a single file, and I need to split it into my four samples.

My first option for doing the demultiplexing is using porechop, which is deprecated, and I was shocked to see that it properly found the barcodes (01/06/07/21) but didn't classify many of them, so the none fastq was the biggest, and the barcodes were very small.

Then, after searching for a solution I found that dorado can do it. Thus, I mapped the reads to the c elegans genome, converted the sam to bam, and I got a useless output. Despite many sequences having the tag of barcodes 01/06/07/21 which I did use,

552794b3-6baa-4cd5-987d-a44e2051799a st:Z:2024-08-11T05:25:19.178+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode21 2a173ef9-c792-4e00-8822-7be2970d5c17 st:Z:2024-08-12T10:06:01.902+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0 841e59d3-92a2-416a-b2e7-9014717256bc st:Z:2024-08-11T05:18:57.456+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode21 fdcd2624-e417-4946-ba2f-33064f64bef1 st:Z:2024-08-11T05:18:33.152+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0 5c17f7dd-19f6-4d0c-bd4e-2c2d9f7de46b st:Z:2024-08-11T05:18:41.182+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode06 27db76c2-c344-4a1d-9486-5ccb5bd939bd st:Z:2024-08-11T05:18:52.532+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ac6f37ee-9604-4af4-ae4a-8940411ab652 st:Z:2024-08-11T05:18:33.067+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0 2f9236ed-8cae-4704-9c16-892c0b00e3fc st:Z:2024-08-11T05:19:10.533+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0 aeaed141-84c9-4636-bbdb-7256684bda35 st:Z:2024-08-11T05:19:03.441+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode01 1c3546fa-99cf-4eb9-b9d9-83486707ea58 st:Z:2024-08-11T05:18:17.934+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode07 a3e1e263-b48c-4738-8eb3-e3d1505f1c51 st:Z:2024-08-11T05:19:11.925+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode06 ad0a0970-c995-4544-b0aa-563b5ddcb711 st:Z:2024-08-11T05:19:06.363+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode07 ca12dd1f-cdd3-485e-8900-fdd24c559c5c st:Z:2024-08-11T05:18:49.290+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode07 4f437f9c-1eaf-40c0-83b0-aa8ab250e718 st:Z:2024-08-11T05:19:24.661+00:00 RG:Z:16b7818be41d06b1e531609ca967b751f1902912_dna_r10.4.1_e8.2_400bps_sup@v4.3.0SQK-RBK114-24barcode21

I only got just a file "unclassified.bam".

The code I used was this:

dorado-0.7.3-linux-x64/bin/dorado demux --output-dir RBK-xtra --no-classify RBK-xtra.bam

What is surprising and embarrassing is that porechop, which is deprecated was able to properly identify the barcodes and demultiplex the fastq file.

Any way to demultiplex my reads into the proper/right barcodes that I used?

malton-ont commented 2 months ago

Hi @desmodus1984,

dorado demux --no-classify does not attempt to classify the barcodes, it simply re-uses the barcode classifications from the input data. Since fastq files do not store the barcode information, they cannot be directly demultiplexed in this manner. Assuming you want fastq files at the end, your options are to either:

  1. Basecall+barcode to BAM format, then demux and output to fastq
    dorado basecaller --kit-name SQK-RBK114-24 dna_r10.4.1_e8.2_400bps_sup@v4.3.0 RBK-xtra > RBK-xtra.bam
    dorado demux --output-dir RBK-xtra --no-classify --emit-fastq RBK-xtra.bam 
  2. Basecall to fastq, then classify and demux from the fastq file
    dorado basecaller --no-trim dna_r10.4.1_e8.2_400bps_sup@v4.3.0 RBK-xtra > RBK-xtra.bam
    dorado demux --output-dir RBK-xtra-demuxed --kit-name SQK-RBK114-24 --emit-fastq RBK-xtra.bam

    Note the --no-trim option during basecalling in option 2! This prevents adapter trimming from interfering with the classification in the next step.

desmodus1984 commented 2 months ago

As usual it failed again. I am using my laptop with Windows to do the basecalling, with a NVIDIA GeForce RTX 4060. It worked fine with dorado 0.6, but now with 0.7.3 it failed. I ran, dorado.exe basecaller --kit-name SQK-RBK114-24 sup C:\Data\RBK-xtra > RBK-xtra.bam

[2024-08-14 13:31:21.051] [info] Running: "basecaller" "--kit-name" "SQK-RBK114-24" "sup" "C:\Data\RBK-xtra" [2024-08-14 13:31:21.231] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0 with httplib [2024-08-14 13:32:43.267] [info] > Creating basecall pipeline [2024-08-14 13:32:49.097] [info] cuda:0 using chunk size 12288, batch size 64 [2024-08-14 13:32:49.987] [info] cuda:0 using chunk size 6144, batch size 128 [2024-08-14 13:33:39.805] [warning] Caught Torch error 'CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. ', clearing CUDA cache and retrying.

And it failed. That is the only GPU that I have access to and should be compatible, and now I can't basecall a ridiculously small dataset , 120 MB.

malton-ont commented 2 months ago

@desmodus1984,

Dorado 0.7.3 uses the v5 sup models, which have a different architecture to the v4.3 one in 0.6.0. Please try reducing the batch size by adding -b 32 to your command.