nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
527 stars 63 forks source link

Unknown Function Error in libdorado_torch_lib #606

Closed ddubocan closed 5 months ago

ddubocan commented 9 months ago

Hi,

I generally parallelize dorado basecalling by generating a list of read_ids via pod5 view, and then split the read_ids and feed them iteratively to dorado via a batch script. This has been fine for the last ~20ish sequencing runs over the past few weeks, but beginning last week ~10% of my dorado runs fail with the same error. The other 90% of runs work fine.

Here is the error:

[2024-01-30 11:51:05.694] [info] > Creating basecall pipeline
[2024-01-30 11:51:12.790] [info]  - set batch size for cuda:0 to 576
[2024-01-30 11:51:13.457] [info] Barcode for SQK-RBK114-24
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Exception raised from createCublasHandle at /pytorch/pyold/aten/src/ATen/cuda/CublasHandlePool.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efbd0e889b7 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7efbca40d115 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa90879b (0x7efbd0d5879b in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: at::cuda::getCurrentCUDABlasHandle() + 0x881 (0x7efbd0d59fd1 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa903be4 (0x7efbd0d53be4 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa90dbf8 (0x7efbd0d5dbf8 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: <unknown function> + 0xa915102 (0x7efbd0d65102 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa617dd4 (0x7efbd0a67dd4 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa617e6d (0x7efbd0a67e6d in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) + 0x1a1 (0x7efbcb53e951 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: torch::nn::LinearImpl::forward(at::Tensor const&) + 0xa3 (0x7efbce4b6f33 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/dorado() [0x9f4e7a]
frame #12: /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/dorado() [0x9f98f8]
frame #13: /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/dorado() [0x9e9bf0]
frame #14: /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/dorado() [0x9e9d58]
frame #15: /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/dorado() [0x9ea388]
frame #16: <unknown function> + 0x1196e380 (0x7efbd7dbe380 in /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: <unknown function> + 0x7ea5 (0x7efbc5bf1ea5 in /lib64/libpthread.so.0)
frame #18: clone + 0x6d (0x7efbc4a3eb0d in /lib64/libc.so.6)

/var/spool/slurmd/job40232545/slurm_script: line 29: 18433 Aborted                 /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/bin/dorado basecaller --modified-bases 6mA 5mC_5hmC --kit-name SQK-RBK114-24 -c 7500 -l $read_ids --reference $reference /oak/stanford/groups/altemose/tools/dorado-0.5.1-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 $pod5 > ${read_ids}.dorado_v5.1_mA_mC.bam
ddubocan commented 9 months ago

Additionally, this only happens when performing barcoding with the --kit-name flag

tijyojwad commented 9 months ago

hmm interesting - that's quite unexpected as barcoding doesn't use any torch functionality.

what's the dorado version and what's the cmdline? what GPU are you using and what's your CUDA version?

when you say this only happens when performing barcoding, the same file was basecalled without kit name and it worked?

HalfPhoton commented 5 months ago

Closing as there's been no response - Please re-open if needed