Segmentation fault (core dumped) in basecalling, alignment, and barcode tagging simultaneously

alimelhakim commented 4 months ago

Dear Nanopore team, I ran rebasecalling, alignment, and barcode tagging simultaneously with the following command on a PC PromethION 24 and encountered a segmentation fault (core dumped) at 51%.

dorado basecaller sup@v5.0.0,5mC_5hmC@v1.0.0,6mA@v1.0.0 --mm2-preset map-ont --reference /home/0.0reference/hg38.mmi --kit-name SQK-NBD114-24 /home/10.140524_LibraryD/ > /rebasecalled/10.140524_LibraryD/libD-full-barcoded.bam

[2024-07-01 17:59:09.170] [info] Running: "basecaller" "sup@v5.0.0,5mC_5hmC@v1.0.0,6mA@v1.0.0" "--mm2-preset" "map-ont" "--reference" "/home/0.0reference/hg38.mmi" "--kit-name" "SQK-NBD114-24" "/home/10.140524_LibraryD/" [2024-07-01 17:59:09.728] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mC_5hmC@v1 with httplib [2024-07-01 17:59:13.113] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_6mA@v1 with httplib [2024-07-01 17:59:16.229] [info] > Creating basecall pipeline [2024-07-01 17:59:27.259] [info] cuda:3 using chunk size 12288, batch size 512 [2024-07-01 17:59:27.261] [info] cuda:2 using chunk size 12288, batch size 480 [2024-07-01 17:59:27.268] [info] cuda:1 using chunk size 12288, batch size 512 [2024-07-01 17:59:27.268] [info] cuda:0 using chunk size 12288, batch size 512 [2024-07-01 17:59:28.005] [info] cuda:3 using chunk size 6144, batch size 512 [2024-07-01 17:59:28.023] [info] cuda:0 using chunk size 6144, batch size 512 [2024-07-01 17:59:28.025] [info] cuda:2 using chunk size 6144, batch size 480 [2024-07-01 17:59:28.043] [info] cuda:1 using chunk size 6144, batch size 512 Segmentation fault (core dumped) 51% [12h:30m:55s<11h:55m:05s] Basecalling

The Library D folder consists of 72 POD5 files, totaling 800 GB. I have tried to split them into four batches (200 GB each), and 2 out of 4 batches were terminated while the others finished successfully. I used Dorado 0.7.0+71cc744 and the model dna_r10.4.1_e8.2_400bps_sup@v5.0.0.

Do you know what the issue might be?

Best regards, Alim

HalfPhoton commented 4 months ago

Hi @alimelhakim,

Can you share some information about your system as outlined in the issue template?

We have seen a number of issues now where v5 models and multiple mods models are causing issues. We're actively working on this to improve the situation. I would expect this issue to manifest as a CUDA error though and as such it could be excessive memory consumption.

Are you able to run dorado on smaller subsets of your data to potentially locate a problematic read?

And are you able to --resume-from to make further progress?

[!WARNING] Do not reuse the filenames for --resume-from and the new output. If they are the same then the interrupted file will be deleted when dorado is launched and the previous work will be lost.
# WARNING: This will overwrite the existing `resume.bam` file before it is used.
dorado basecaller hac pod5/ --resume-from resume.bam > resume.bam

Kind regards, Rich

malton-ont commented 4 months ago

Possibly related to https://github.com/nanoporetech/dorado/issues/860. Can you try with dorado-0.7.2?

alimelhakim commented 4 months ago

Thanks a lot, @HalfPhoton and @malton-ont, for your support. I apologize for the delayed response.

I attempted partial basecalling of each pod5 file. Some pod5 files resulted in a Segmentation fault (core dumped) during basecalling. Here is the log file for one of the failed basecalling attempts: log.dorado.0.7.0_31.txt

Run environment:

Dorado version: 0.7.0+71cc744
model: dna_r10.4.1_e8.2_400bps_sup@v5.0.0.
Operating system: Ubuntu 20.04
Hardware (CPUs, Memory, GPUs): Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz 160 threads, 512GB RAM, 1xT100 8GB, 4xA100 80GB
Source data type : POD5
Source data location : on device
Details about data (flow cell: FLO-PRO114M, kit: SQK-NBD114-24, estimated N50: 700b, number of reads 75M, total dataset size 800 GB POD5 separated in 72 files; one file per hour running):

I have also tried using --resume-from, but the second basecalling attempt did not show further progress and ended with the same error.

I attempted to rerun the command from the log file above with dorado 0.7.2, and the basecalling completed successfully. Now, I am attempting to basecall all files with initial scheme

alimelhakim commented 4 months ago

dorado 0.7.2 works for me, thanks

nanoporetech / dorado

Segmentation fault (core dumped) in basecalling, alignment, and barcode tagging simultaneously #925

Run environment: