nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
531 stars 63 forks source link

Dorado0.8.0 lost lots of reads after rebasecalling #1047

Closed SimonChen1997 closed 1 month ago

SimonChen1997 commented 1 month ago

Issue Report

Please describe the issue:

The target base number of output fastq should be over 500M, which was true when using Dorado 0.6.0. However, when I used Dorado 0.8.0, the largest fastq file only had 2M bases.

Steps to reproduce the issue:

$dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

$dorado demux --output-dir $output_scratch_demultiplex --kit-name SQK-NBD114-24 $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

Run environment:

$dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

$dorado demux --output-dir $output_scratch_demultiplex --kit-name SQK-NBD114-24 $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam

Logs

[2024-09-28 07:11:16.955] [info] Running: "basecaller" "--recursive" "/scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "/scratch/project/genoepic_rumen/ecoli_dna_methyl/pod5" "--modified-bases" "6mA" "--kit-name" "SQK-NBD114-24" [2024-09-28 07:11:17.807] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar SSL_CERT_FILE to specify the location manually. [2024-09-28 07:11:17.813] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v2 with httplib [2024-09-28 07:11:17.877] [error] Failed to download dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v2: SSL server verification failed [2024-09-28 07:11:17.877] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v2 with curl % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M 23 18.4M 23 4375k 0 0 71.6M 0 --:--:-- --:--:-- --:--:-- 71.2M^M100 18.4M 100 18.4M 0 0 170M 0 --:--:-- --:--:-- --:--:-- 169M [2024-09-28 07:11:18.226] [info] > Creating basecall pipeline [2024-09-28 07:12:07.562] [warning] Unable to find chunk benchmarks for GPU "NVIDIA H100 PCIe", model /scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 and chunk size 1728. Full benchmarking will run for this device, which may take some time. [2024-09-28 07:12:07.562] [warning] Unable to find chunk benchmarks for GPU "NVIDIA H100 PCIe", model /scratch/project/genoepic_rumen/dorado-0.8.0-linux-x64/data/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 and chunk size 1728. Full benchmarking will run for this device, which may take some time. [2024-09-28 07:12:08.922] [info] cuda:0 using chunk size 12288, batch size 96 [2024-09-28 07:12:08.922] [info] cuda:1 using chunk size 12288, batch size 96 [2024-09-28 07:12:09.008] [info] cuda:0 using chunk size 6144, batch size 96 [2024-09-28 07:12:09.013] [info] cuda:1 using chunk size 6144, batch size 96 terminate called after throwing an instance of 'std::runtime_error' what(): Empty sequence and qstring provided for read id 39d5fcd5-ac11-48f5-acea-169a2736a9f0 /var/spool/slurmd/job10990506/slurm_script: line 33: 2837796 Aborted (core dumped) $dorado basecaller --recursive $model $input_scratch --modified-bases 6mA --kit-name SQK-NBD114-24 > $output_scratch_bam/ecoli_dna_exp_sta_6mA_sup.bam [2024-09-28 07:42:06.080] [info] Running: "demux" "--output-dir" "/scratch/project/genoepic_rumen/ecoli_dna_methyl_dorado_0_8/demultiplex_sup" "--kit-name" "SQK-NBD114-24" "/scratch/project/genoepic_rumen/ecoli_dna_methyl_dorado_0_8/bam_sup/ecoli_dna_exp_sta_6mA_sup.bam" [W::bam_hdr_read] EOF marker is absent. The input is probably truncated [2024-09-28 07:42:06.119] [info] num input files: 1 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated [2024-09-28 07:42:06.382] [info] > starting barcode demuxing

HalfPhoton commented 1 month ago

Hi @SimonChen1997, It looks like the original base calling job crashed. This is why you have very little output.

terminate called after throwing an instance of 'std::runtime_error'
what(): Empty sequence and qstring provided for read id 39d5fcd5-ac11-48f5-acea-169a2736a9f0

It looks like you have a problematic read.

The demix job is also telling you there's something wrong with the base calling output

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated

Best regards, Rich

SimonChen1997 commented 1 month ago

The demix job is also telling you there's something wrong with the base calling output

Hi,

Thanks for your reply. However, all the pod5 files can be successfully rebased using Dorado 0.6.0. Can I ask the reason for it?

Cheers, Ziming

malton-ont commented 1 month ago

This is presumably a variant on https://github.com/nanoporetech/dorado/issues/1020.

Also note: you are performing barcoding twice. You only need to specify --kit-name to either dorado basecaller or to dorado demux - your current command will lead to many unclassified reads as the barcodes will be trimmed after the first step. Since you are seeing this error, I suggest dropping it from the basecaller command, (and possibly adding --no-trim), then let dorado demux handle the barcoding and trimming.

SimonChen1997 commented 1 month ago

This is presumably a variant on #1020.

Also note: you are performing barcoding twice. You only need to specify --kit-name to either dorado basecaller or to dorado demux - your current command will lead to many unclassified reads as the barcodes will be trimmed after the first step. Since you are seeing this error, I suggest dropping it from the basecaller command, (and possibly adding --no-trim), then let dorado demux handle the barcoding and trimming.

Hi,

Thanks. I did use --no-trim after I posted the issue, and it worked. However, without adding --no-trim flag worked well for 0.6.0 version. Anyways, thanks for your reply. 😊

malton-ont commented 1 month ago

This issue should be resolved in dorado 0.8.1, which has just been released.