nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
527 stars 63 forks source link

CUDA device requested but no devices found #251

Closed krpcem closed 1 year ago

krpcem commented 1 year ago

Hello,

I downloaded dorado today from github and installed it using the instructions there. I was able to download a model using dorado. When I attemopted base calling, however CUDA devices coutl not be found. They are found however using pytorch.

When I attempt to run base calling it ends quickly dorado basecaller --emit-fastq dna_r10.4.1_e8.2_260bps_fast@v4.1.0 -x all /reads_volume/test [05:02:31.076] [info] > Creating basecall pipeline [05:02:31.078] [error] CUDA device requested but no devices found.

I am able to see the GPU using python's torch module

import torch torch.cuda.is_available() True torch.cuda.device_count() 1 print(torch.version.cuda) 11.7

System details: Details: dorado Version: 0.3.0+e2ba869 Build cuda_11.5.r11.5/compiler.30672275_0 AWS: ubuntu-pro-server/images/hvm-ssd/ubuntu-jammy-22.04-amd64-pro-server-20230531 lspci: 00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

Thank you for any help you can provide.

iiSeymour commented 1 year ago

@krpcem what do you see if you replace -x all with -x cuda:0?

krpcem commented 1 year ago

Thanks iiSeymour is cuda:0 the first GPU ? As you can see below it has a different issue.

ubuntu@ip-172-31-15-219:~$ dorado basecaller --emit-fastq  dna_r10.4.1_e8.2_260bps_fast@v4.1.0 -x cuda:0  /reads_volume/test
[2023-06-20 11:54:12.647] [info] > Creating basecall pipeline
[2023-06-20 11:54:28.051] [error] CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Exception raised from gemm<c10::Half> at ../aten/src/ATen/cuda/CUDABlas.cpp:446 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fc13ae5a6bb in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7fc13ae555ef in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x30aa92b (0x7fc13e6aa92b in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x30de70b (0x7fc13e6de70b in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #4: at::native::structured_mm_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) + 0x56 (0x7fc13e6df1d6 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x2dc04cc (0x7fc13e3c04cc in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x2dc0583 (0x7fc13e3c0583 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #7: at::_ops::mm::call(at::Tensor const&, at::Tensor const&) + 0xdb (0x7fc1944400ab in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x187f7df (0x7fc19387f7df in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #9: at::native::matmul(at::Tensor const&, at::Tensor const&) + 0x58 (0x7fc19387fd48 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x29ab4d3 (0x7fc1949ab4d3 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #11: at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) + 0xdb (0x7fc1945504bb in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x286cfe (0x5640cac2acfe in dorado)
frame #13: <unknown function> + 0x192721 (0x5640cab36721 in dorado)
frame #14: <unknown function> + 0x204d5f (0x5640caba8d5f in dorado)
frame #15: <unknown function> + 0x2051a9 (0x5640caba91a9 in dorado)
frame #16: <unknown function> + 0x1c7efb (0x5640cab6befb in dorado)
frame #17: <unknown function> + 0x1c802e (0x5640cab6c02e in dorado)
frame #18: <unknown function> + 0x2063f9 (0x5640cabaa3f9 in dorado)
frame #19: <unknown function> + 0x2069c9 (0x5640cabaa9c9 in dorado)
frame #20: <unknown function> + 0x1c7d2b (0x5640cab6bd2b in dorado)
frame #21: <unknown function> + 0x28761f (0x5640cac2b61f in dorado)
frame #22: <unknown function> + 0x280153 (0x5640cac24153 in dorado)
frame #23: <unknown function> + 0x27f8ca (0x5640cac238ca in dorado)
frame #24: <unknown function> + 0x13626d (0x5640caada26d in dorado)
frame #25: <unknown function> + 0x13adb8 (0x5640caadedb8 in dorado)
frame #26: <unknown function> + 0xdfc49 (0x5640caa83c49 in dorado)
frame #27: <unknown function> + 0x29d90 (0x7fc12f829d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x80 (0x7fc12f829e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: <unknown function> + 0xe6a95 (0x5640caa8aa95 in dorado)
krpcem commented 1 year ago

In the end I returned to guppy. I hope the next time I have a set to run Dorado is ready out of the box.

iiSeymour commented 1 year ago

Sorry @krpcem, the issue above is because dorado is picking up a libcublas.so.11 from somewhere else on your system and not the dorado lib directory. You can confirm that with ldd ./dorado-0.3.0/bin/dorado and the fix is to put ./dorado-0.3.0/lib first on your $LD_LIBRARY_PATH.

We are moving to static linking to avoid this problem very soon.

buhanfeng commented 1 year ago

Hello @iiSeymour , I encounter the same problem but with different situation: here is the error message from dorado:

1688714296160

my GPU and cuda info:

1688714418459 1688714859994 1688714222639

could your offer me some advice why dorado can't find my device? does it involves in version incompatibility problem? thanks.

SwapnilDoijad commented 1 year ago

dorado v0.3.1 same issue: [error] CUDA device requested but no devices found.

$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0

$dorado basecaller --emit-fastq -v dna_r10.4.1_e8.2_400bps_hac@v4.2.0 pod5/ > duplex.bam [2023-07-07 10:51:55.585] [info] > No duplex pairs file provided, pairing will be performed automatically [2023-07-07 10:51:57.124] [debug] > Reads to process: 3000000 [2023-07-07 10:52:00.162] [debug] Written 0 records. [2023-07-07 10:52:00.162] [error] CUDA device requested but no devices found. $

warthmann commented 1 year ago

Hello, I am encountering the same issue. I am attempting to use dorado through epi2me and docker. Other epi2me workflows work for me as does base calling through guppy using cuda from the commandline. I am on a ThinkPad T580 running ubuntu 20.04. Below is parts of the logfile of epi2me and a screenshot of docker invoking nvidia-smi. Any help is greatly appreciated. In any case, the workflow would benefit from a command to test and trouble shoot the setup. thanks a lot!

This is epi2me-labs/wf-basecalling v0.7.2.

[34/d9c940] Submitted process > getParams [d8/26b1e8] Submitted process > getVersions [ea/7aa6db] Submitted process > wf_dorado:dorado (3) [ec/d81b14] Submitted process > wf_dorado:dorado (1) ERROR ~ Error executing process > 'wf_dorado:dorado (3)' Caused by: Process wf_dorado:dorado (3) terminated with an error exit status (1) Command executed: set +e source /opt/nvidia/entrypoint.d/*-gpu-driver-check.sh # runtime driver check msg set -e dorado basecaller ${DRD_MODELS_PATH}/dna_r9.4.1_e8_hac@v3.3 . --device cuda:all | samtools view --no-PG -b -o 2.ubam - Command exit status: 1 Command output: (empty) Command error: [2023-07-08 14:43:32.251] [info] > Creating basecall pipeline [W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (function operator()) [2023-07-08 14:43:32.258] [error] CUDA device requested but no devices found. [main_samview] fail to read the header from "-". Work dir: /home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/work/ea/7aa6dbaac4484b161e5b896a7de3f5 Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh -- Check '/home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/nextflow.log' file for details WARN: Killing running tasks (1)

Screenshot from 2023-07-08 16-53-02

SamStudio8 commented 1 year ago

@warthmann I've opened your comment as an issue on the wf-basecalling repository: https://github.com/epi2me-labs/wf-basecalling/issues/12

tijyojwad commented 1 year ago

Hi @buhanfeng - it looks like you're running dorado as root but running nvidia-smi as your own user. Are you running dorado with a docker container or something similar? Can you check if nvidia-smi within that same environment is showing the GPUs?

wangguiqian commented 5 months ago

I get the same issue: Could you give me some advice ? [2024-05-21 01:10:14.073] [info] Running: "basecaller" "../../Data/rna002_70bps_hac@v3/" "../../sx-50h/pass_data/" "--min-qscore" "1" "--resume-from" "in [2024-05-21 01:10:14.108] [info] > Creating basecall pipeline [2024-05-21 01:13:05.497] [info] - BAM format does not support U, so RNA output files will include T instead of U for all file types. [2024-05-21 01:13:08.491] [error] CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b32c25159b7 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/li frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x2b32bba9a115 in /public/home/software/opt/bio frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x2b32c24df958 in /public/home/software/opt/bio/sof frame #3: void at::native::gpu_kernel_impl<at::native::FillFunctor >(at::TensorIteratorBase&, at::native::FillFunctor const&) + 0x9 frame #4: void at::native::gpu_kernel<at::native::FillFunctor >(at::TensorIteratorBase&, at::native::FillFunctor const&) + 0x33b (0 frame #5: + 0x9216dd5 (0x2b32c0cf3dd5 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #6: at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) + 0x20 (0x2b32c0cf4f00 in /public/home/software/opt/bio/software/dorado/0 frame #7: + 0x49823a3 (0x2b32bc45f3a3 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #8: + 0xa61c4b3 (0x2b32c20f94b3 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::fill_Scalar::call(at::Tensor&, c10::Scalar const&) + 0x12c (0x2b32bcbb092c in /public/home/software/opt/bio/software/dorado/0.6.1/bi frame #10: at::native::zero(at::Tensor&) + 0xa7 (0x2b32bc45fa67 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so frame #11: + 0xa61b80d (0x2b32c20f880d in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #12: at::ops::zero::call(at::Tensor&) + 0x129 (0x2b32bcfed499 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_l frame #13: at::native::zeros_symint(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c frame #14: + 0x588d645 (0x2b32bd36a645 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #15: at::_ops::zeros::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::optional, c10::optional, c10:: frame #16: + 0x56c4835 (0x2b32bd1a1835 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #17: at::_ops::zeros::call(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10: frame #18: at::native::cudnn_rnn::copy_weights_to_flat_buf_views(c10::ArrayRef, long, long, long, long, long, long, bool, bool, cudnnDataType frame #19: at::native::_cudnn_rnn_flatten_weight(c10::ArrayRef, long, long, long, long, long, long, bool, bool) + 0x90 (0x2b32c04c8410 in /pu frame #20: + 0xa630fe9 (0x2b32c210dfe9 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #21: + 0xa66700f (0x2b32c214400f in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #22: + 0x52c4fc4 (0x2b32bcda1fc4 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #23: at::_ops::_cudnn_rnn_flatten_weight::call(c10::ArrayRef, long, c10::SymInt, long, c10::SymInt, c10::SymInt, long, bool, bool) + 0x frame #24: + 0x80b3c06 (0x2b32bfb90c06 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #25: torch::nn::detail::RNNImplBase::flatten_parameters() + 0x346 (0x2b32bfb99d26 in /public/home/software/opt/bio/software/do frame #26: void torch::nn::Module::to_impl<c10::Device&, bool&>(c10::Device&, bool&) + 0xd0 (0x2b32bfac3030 in /public/home/software/opt/bio/software/dor frame #27: torch::nn::Module::to(c10::Device, bool) + 0x1c (0x2b32bfabc21c in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_to frame #28: void torch::nn::Module::to_impl<c10::Device&, bool&>(c10::Device&, bool&) + 0xd0 (0x2b32bfac3030 in /public/home/software/opt/bio/software/dor frame #29: torch::nn::Module::to(c10::Device, bool) + 0x1c (0x2b32bfabc21c in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_to frame #30: dorado() [0x9bfd9e]
frame #31: dorado() [0x9eacd1]
frame #32: dorado() [0x939815]
frame #33: dorado() [0x85898f]
frame #34: dorado() [0x85873b]
frame #35: pthread_once + 0x50 (0x2b33247e7e70 in /lib64/libpthread.so.0)
frame #36: dorado() [0x858daf]
frame #37: dorado() [0x85b180]
frame #38: + 0x1196e380 (0x2b32c944b380 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so)
frame #39: + 0x7e25 (0x2b33247e2e25 in /lib64/libpthread.so.0)
frame #40: clone + 0x6d (0x2b3325c8cbad in /lib64/libc.so.6)

tijyojwad commented 5 months ago

Hi @wangguiqian - my guess is you're running an older GPU that's unsupported by dorado. Can you post a screenshot of nvidia-smi?

wangguiqian commented 5 months ago

my GPU version :cuda 11.4

wangguiqian commented 5 months ago

gpu01 P100,gpu02 K40m

tijyojwad commented 5 months ago

Unfortunately neither of those architecture are supported by dorado. Dorado is built for acceleration on Volta and newer GPUs.

wangguiqian commented 5 months ago

Because I can only use the LSF server to schedule the Linux operating system, and it is impossible to connect to the network for security reasons. How can I check the characteristics of my GPU version in the LSF scheduling system?

wangguiqian commented 5 months ago

@tijyojwad thank you very much for your help

tijyojwad commented 5 months ago

you should be able to runnvidia-smi in an LSF scheduled job to get the same output that lists GPUs and driver/cuda version.

wangguiqian commented 5 months ago
image
wangguiqian commented 5 months ago

I run nvidia-smi in GPU LSF scheduled ,the result: image

wangguiqian commented 5 months ago
image

Thank you very much for you @tijyojwad

tijyojwad commented 5 months ago

looks like this is P100 too, which isn't supported

wangguiqian commented 5 months ago

Is there any other way for me to use basecaller? Is it possible with guppy? My data is Direct-RNA sequencing using the RNA002 chip. Thank you so much