Closed mulderdt closed 3 days ago
Hi @mulderdt, we're working on tuning the auto batchsize calculation to deliver both stability and performance which is challenging over many hardware variants.
You found the appropriate solution though which is to tune the batchsize yourself.
Does this mean you're issue is resolved?
Best regards, Rich
I have same issue when running on only four POD5 files.
command:
dorado basecaller hac,5mCG_5hmCG /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/ -x auto --verbose -r --reference /project/sli68423_1316/users/yang/reference/chm13v2.0/chm13v2.0.fa --output-dir /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/dorado_calls_PT01
error message:
[2024-10-30 09:54:39.902] [info] Running: "basecaller" "hac,5mCG_5hmCG" "/project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/" "-x" "auto" "--verbose" "-r" "--reference" "/project/sli68423_1316/users/yang/reference/chm13v2.0/chm13v2.0.fa" "--
output-dir" "/project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/dorado_calls_PT01"
[2024-10-30 09:54:43.515] [info] Assuming cert location is /etc/ssl/certs/ca-bundle.crt
[2024-10-30 09:54:43.517] [info] - downloading dna_r10.4.1_e8.2_400bps_hac@v5.0.0 with httplib
[2024-10-30 09:54:44.514] [info] - downloading dna_r10.4.1_e8.2_400bps_hac@v5.0.0_5mCG_5hmCG@v2 with httplib
[2024-10-30 09:54:45.484] [info] Normalised: chunksize 10000 -> 9996
[2024-10-30 09:54:45.484] [info] Normalised: overlap 500 -> 498
[2024-10-30 09:54:45.484] [info] > Creating basecall pipeline
[2024-10-30 09:54:45.485] [debug] CRFModelConfig { qscale:1.050000 qbias:-0.600000 stride:6 bias:0 clamp:1 out_features:-1 state_len:4 outsize:1024 blank_score:2.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:9996 overlap:498 batch_size:0} convs: { 0: ConvParams { insize:1 size:16 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:16 size:16 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:16 size:384 winlen:19 stride:6 activation:tanh}} model_type: lstm { bias:0 outsize:1024 blank_score:2.000000 scale:1.000000}}
[2024-10-30 09:54:49.136] [debug] cuda:0 memory available: 41.59GB
[2024-10-30 09:54:49.136] [debug] cuda:0 memory limit 39.78GB
[2024-10-30 09:54:49.136] [debug] cuda:0 maximum safe estimated batch size at chunk size 9996 is 5952
[2024-10-30 09:54:49.136] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 11968
[2024-10-30 09:54:49.136] [debug] Auto batchsize cuda:0: testing up to 10240 in steps of 64
[2024-10-30 09:54:49.136] [debug] Largest batch size for cuda:0: 6336, time per chunk 0.041739 ms
[2024-10-30 09:54:49.136] [debug] Final batch size for cuda:0[0]: 5760
[2024-10-30 09:54:49.136] [debug] Final batch size for cuda:0[1]: 6336
[2024-10-30 09:54:49.136] [info] cuda:0 using chunk size 9996, batch size 5760
[2024-10-30 09:54:49.136] [debug] cuda:0 Model memory 27.02GB
[2024-10-30 09:54:49.136] [debug] cuda:0 Decode memory 11.15GB
[2024-10-30 09:54:50.524] [info] cuda:0 using chunk size 4998, batch size 6336
[2024-10-30 09:54:50.524] [debug] cuda:0 Model memory 14.86GB
[2024-10-30 09:54:50.524] [debug] cuda:0 Decode memory 6.13GB
[2024-10-30 09:55:57.031] [debug] Loaded index with 25 target seqs
[2024-10-30 09:56:00.069] [debug] BasecallerNode chunk size 9996
[2024-10-30 09:56:00.069] [debug] BasecallerNode chunk size 4998
[2024-10-30 09:56:00.097] [debug] Load reads from file /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/pod5_five/PAO20051_pass_f9438981_ac3ed4cb_1001.pod5
[2024-10-30 09:56:01.317] [debug] Load reads from file /project/sli68423_1316/users/yang/aml_nanopore/results/2024_10_30_guppy_run_test/run_script_PT01/input_pod5/pod5_five/PAO20051_pass_f9438981_ac3ed4cb_1000.pod5
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f468aa3e9b7 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4683fc3115 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x7f468a90e79b in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0xb65 (0x7f468a9102b5 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x7f468a909be4 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x7f468a913bf8 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x7f468a91b102 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x7f468a61ddd4 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x7f468a61de6d in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x7f46850f4951 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x7f468806cf33 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: dorado() [0xad007a]
frame #12: dorado() [0xad4398]
frame #13: dorado() [0xac0370]
frame #14: dorado() [0xac04d8]
frame #15: dorado() [0xabc963]
frame #16: <unknown function> + 0x1196e380 (0x7f4691974380 in /project/sli68423_1316/users/yang/tools/dorado-0.8.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x7ea5 (0x7f467f046ea5 in /lib64/libpthread.so.0)
frame #18: clone + 0x6d (0x7f467e2bbb0d in /lib64/libc.so.6)
Aborted
@liuyang2006, reduce the batch size:
Auto batchsize used 6336. Try 5120
dorado basecaller <model> <reads> --batchsize 5120 ... > calls.bam
Issue Report
Please describe the issue:
Basecalling of HG002 pod5s using dorado v0.7.2 as part of wf-basecalling results in termination of basecalling for some pod5s. This run has been previously basecalled multiple times successfully using dorado and wf-basecalling with v4 models. When cancelled I had 188/728 pod5s basecalled with 27 failures.
Individual pod5s were submitted for basecalling using A6000 and 3090 GPUs on slurm cluster. Failures were only observed on 3090 GPUs. Failures were observed across multiple nodes and not associated with any particular device or node. Failed pod5s were not associated with any particular time in the run (e.g. pod5s that failed include pod5 numbers 45, 194, 340, 478, 609, 611). Our pod5s have ~15,000 reads in them when checking the resulting bams.
Checking the batchsize that was being selected by dorado, I observe example passing:
example failing:
I resolved the issue by manually setting batchsize parameter to 256.
Suggests to me that automatic selection of optimal batchsize for the v5 model needs a tweak for 3090 24GB GPU.
Steps to reproduce the issue:
ran wf-basecalling using version v1.3.0 which wraps dorado v0.7.2
wf-basecalling
with config:
Run environment:
Logs
from .command.log