nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
527 stars 63 forks source link

basecalling failing with v5 model on some pod5s with 3090 GPU (24GB) but not A6000 GPU (48GB) #1013

Closed mulderdt closed 3 days ago

mulderdt commented 2 months ago

Issue Report

Please describe the issue:

Basecalling of HG002 pod5s using dorado v0.7.2 as part of wf-basecalling results in termination of basecalling for some pod5s. This run has been previously basecalled multiple times successfully using dorado and wf-basecalling with v4 models. When cancelled I had 188/728 pod5s basecalled with 27 failures.

Individual pod5s were submitted for basecalling using A6000 and 3090 GPUs on slurm cluster. Failures were only observed on 3090 GPUs. Failures were observed across multiple nodes and not associated with any particular device or node. Failed pod5s were not associated with any particular time in the run (e.g. pod5s that failed include pod5 numbers 45, 194, 340, 478, 609, 611). Our pod5s have ~15,000 reads in them when checking the resulting bams.

Checking the batchsize that was being selected by dorado, I observe example passing:

[info] cuda:5 using chunk size 12288, batch size 480
[info] cuda:5 using chunk size 6144, batch size 480

example failing:

 [info] cuda:3 using chunk size 12288, batch size 256
 [info] cuda:3 using chunk size 6144, batch size 480

I resolved the issue by manually setting batchsize parameter to 256.

Suggests to me that automatic selection of optimal batchsize for the v5 model needs a tweak for 3090 24GB GPU.

Steps to reproduce the issue:

ran wf-basecalling using version v1.3.0 which wraps dorado v0.7.2

wf-basecalling

SAMPLENAME=HG002-hg38-v5
INPUT=/projects/sbs_primary1/promethion1/data/F134812/PAS57600/20231107_1024_5A_PAS57600_bc8ee63d/pod5
WORK=/projects/dmulder_prj/scratch/steve/nextflow/work
##
MODEL=dna_r10.4.1_e8.2_400bps_sup@v5.0.0
REMORA=dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1
OUTPUT=/projects/dmulder_prj/scratch/steve/wf-basecalling/$SAMPLENAME
##
GENOME=/projects/alignment_references/Homo_sapiens/hg38_no_alt/genome/fasta/hg38_no_alt.fa
##
CONFIG=/projects/dmulder_prj/scratch/nextflow/production-workflows/configs/wf-basecalling.config
##export to fix libtorch issue with multiple libraries
export KMP_DUPLICATE_LIB_OK=TRUE
##set ulimit otherwise nextflow fails on the snvpileup
ulimit -u 100000
##
mkdir -p $OUTPUT
##
nextflow run epi2me-labs/wf-basecalling \
    -r v1.3.0 \
    -profile slurmgpu \
        -c $CONFIG \
    -w $WORK \
    --input $INPUT \
    --basecaller_cfg $MODEL \
    --remora_cfg $REMORA \
    --dorado_ext pod5 \
    --ref $GENOME \
    --sample_name $SAMPLENAME \
    --out_dir $OUTPUT \
    -resume 

with config:

params {
    basecaller_chunk_size = 1
    cuda_device = "cuda:\${SLURM_JOB_GPUS}"
}

profiles {
    slurmgpu {
        singularity {
            enabled = true
            autoMounts = true
            envWhitelist = "SLURM_JOB_GPUS,KMP_DUPLICATE_LIB_OK"
        }
        process.executor = 'slurm'
        shell = ['/bin/bash', '-eo', 'pipefail']
        process."withLabel:gpu".queue = "gpu3090,gpuA6000"
        process."withLabel:gpu".memory = { 48.GB * task.attempt }
        process."withLabel:gpu".errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'ignore' }
        process."withLabel:gpu".maxRetries =3
        process."withLabel:gpu".clusterOptions = "--gres=gpu:1"
        process."withLabel:gpu".maxForks = null
        process."withLabel:gpu".containerOptions = "--nv"
        process."withLabel:wf_basecalling".queue = "cputesting"
        process."withLabel:wf_basecalling".memory = "64 GB"
        process."withLabel:wf_basecalling".cpus = 8
        process."withLabel:wf_common".queue = "cputesting"
        process."withLabel:wf_common".memory = "64 GB"
        process."withLabel:wf_common".cpus = 8
    }
}

Run environment:

instrument=PC48B254
position=5A
flow_cell_id=PAS57600
sample_id=PAS57600
protocol_group_id=F134812
protocol=sequencing/sequencing_PRO114_DNA_e8_2_400K:FLO-PRO114M:SQK-LSK114-XL:400
protocol_run_id=bc8ee63d-5b88-45f2-adb5-fa15eb568da6
acquisition_run_id=3f415733bdb768fe6f02c05c359ff9341e981df5
started=2023-11-07T10:25:59.076080-08:00
acquisition_stopped=2023-11-11T10:26:01.328013-08:00
processing_stopped=2023-11-11T10:26:03.362334-08:00
basecalling_enabled=0
sequencing_summary_file=sequencing_summary_PAS57600_bc8ee63d_3f415733.txt
fast5_files_in_final_dest=0
fast5_files_in_fallback=0
pod5_files_in_final_dest=728
pod5_files_in_fallback=0
fastq_files_in_final_dest=0
fastq_files_in_fallback=0

Logs

from .command.log

INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /etc/localtime required more than 50 (83) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (399) bind mounts
[2024-09-03 17:14:49.773] [info] Running: "basecaller" "/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "." "--modified-bases-models" "/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v1" "--device" "cuda:3"
[2024-09-03 17:14:49.842] [info] > Creating basecall pipeline
[2024-09-03 17:15:03.211] [info] cuda:3 using chunk size 12288, batch size 256
[2024-09-03 17:15:03.933] [info] cuda:3 using chunk size 6144, batch size 480
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14c2e616c9b7 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14c2df6f1115 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x14c2e603c79b in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0xb65 (0x14c2e603e2b5 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x14c2e6037be4 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x14c2e6041bf8 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x14c2e6049102 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x14c2e5d4bdd4 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x14c2e5d4be6d in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x14c2e0822951 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x14c2e379af33 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #11: dorado() [0xa77cca]
frame #12: dorado() [0xa7c9d8]
frame #13: dorado() [0xa6a2b0]
frame #14: dorado() [0xa6a418]
frame #15: dorado() [0xa66efb]
frame #16: <unknown function> + 0x1196e380 (0x14c2ed0a2380 in /home/epi2melabs/dorado/lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x8609 (0x14c2db2cd609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #18: clone + 0x43 (0x14c2da725353 in /lib/x86_64-linux-gnu/libc.so.6)
HalfPhoton commented 2 months ago

Hi @mulderdt, we're working on tuning the auto batchsize calculation to deliver both stability and performance which is challenging over many hardware variants.

You found the appropriate solution though which is to tune the batchsize yourself.

Does this mean you're issue is resolved?

Best regards, Rich

liuyang2006 commented 1 week ago

I have same issue when running on only four POD5 files.

command:

dorado basecaller hac,5mCG_5hmCG /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/ -x auto --verbose -r --reference /project/sli68423_1316/users/yang/reference/chm13v2.0/chm13v2.0.fa --output-dir /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/dorado_calls_PT01

error message:

[2024-10-30 09:54:39.902] [info] Running: "basecaller" "hac,5mCG_5hmCG" "/project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/" "-x" "auto" "--verbose" "-r" "--reference" "/project/sli68423_1316/users/yang/reference/chm13v2.0/chm13v2.0.fa" "--
output-dir" "/project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/dorado_calls_PT01"
[2024-10-30 09:54:43.515] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-10-30 09:54:43.517] [info]  - downloading dna_r10.4.1_e8.2_400bps_hac@v5.0.0 with httplib
[2024-10-30 09:54:44.514] [info]  - downloading dna_r10.4.1_e8.2_400bps_hac@v5.0.0_5mCG_5hmCG@v2 with httplib
[2024-10-30 09:54:45.484] [info] Normalised: chunksize 10000 -> 9996
[2024-10-30 09:54:45.484] [info] Normalised: overlap 500 -> 498
[2024-10-30 09:54:45.484] [info] > Creating basecall pipeline
[2024-10-30 09:54:45.485] [debug] CRFModelConfig { qscale:1.050000 qbias:-0.600000 stride:6 bias:0 clamp:1 out_features:-1 state_len:4 outsize:1024 blank_score:2.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:9996 overlap:498 batch_size:0} convs: { 0: ConvParams { insize:1 size:16 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:16 size:16 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:16 size:384 winlen:19 stride:6 activation:tanh}} model_type: lstm { bias:0 outsize:1024 blank_score:2.000000 scale:1.000000}}
[2024-10-30 09:54:49.136] [debug] cuda:0 memory available: 41.59GB
[2024-10-30 09:54:49.136] [debug] cuda:0 memory limit 39.78GB
[2024-10-30 09:54:49.136] [debug] cuda:0 maximum safe estimated batch size at chunk size 9996 is 5952
[2024-10-30 09:54:49.136] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 11968
[2024-10-30 09:54:49.136] [debug] Auto batchsize cuda:0: testing up to 10240 in steps of 64
[2024-10-30 09:54:49.136] [debug] Largest batch size for cuda:0: 6336, time per chunk 0.041739 ms
[2024-10-30 09:54:49.136] [debug] Final batch size for cuda:0[0]: 5760
[2024-10-30 09:54:49.136] [debug] Final batch size for cuda:0[1]: 6336
[2024-10-30 09:54:49.136] [info] cuda:0 using chunk size 9996, batch size 5760
[2024-10-30 09:54:49.136] [debug] cuda:0 Model memory 27.02GB
[2024-10-30 09:54:49.136] [debug] cuda:0 Decode memory 11.15GB
[2024-10-30 09:54:50.524] [info] cuda:0 using chunk size 4998, batch size 6336
[2024-10-30 09:54:50.524] [debug] cuda:0 Model memory 14.86GB
[2024-10-30 09:54:50.524] [debug] cuda:0 Decode memory 6.13GB
[2024-10-30 09:55:57.031] [debug] Loaded index with 25 target seqs
[2024-10-30 09:56:00.069] [debug] BasecallerNode chunk size 9996
[2024-10-30 09:56:00.069] [debug] BasecallerNode chunk size 4998
[2024-10-30 09:56:00.097] [debug] Load reads from file /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/pod5_five/PAO20051_pass_f9438981_ac3ed4cb_1001.pod5
[2024-10-30 09:56:01.317] [debug] Load reads from file /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/pod5_five/PAO20051_pass_f9438981_ac3ed4cb_1000.pod5
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f468aa3e9b7 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4683fc3115 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x7f468a90e79b in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0xb65 (0x7f468a9102b5 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x7f468a909be4 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x7f468a913bf8 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x7f468a91b102 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x7f468a61ddd4 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x7f468a61de6d in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x7f46850f4951 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x7f468806cf33 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: dorado() [0xad007a]
frame #12: dorado() [0xad4398]
frame #13: dorado() [0xac0370]
frame #14: dorado() [0xac04d8]
frame #15: dorado() [0xabc963]
frame #16: <unknown function> + 0x1196e380 (0x7f4691974380 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x7ea5 (0x7f467f046ea5 in /lib64/libpthread.so.0)
frame #18: clone + 0x6d (0x7f467e2bbb0d in /lib64/libc.so.6)

Aborted
HalfPhoton commented 1 week ago

@liuyang2006, reduce the batch size:

Auto batchsize used 6336. Try 5120

dorado basecaller <model> <reads> --batchsize 5120 ... > calls.bam