CUDA device requested but no devices found

krpcem commented 1 year ago

Hello,

I downloaded dorado today from github and installed it using the instructions there. I was able to download a model using dorado. When I attemopted base calling, however CUDA devices coutl not be found. They are found however using pytorch.

When I attempt to run base calling it ends quickly dorado basecaller --emit-fastq dna_r10.4.1_e8.2_260bps_fast@v4.1.0 -x all /reads_volume/test [05:02:31.076] [info] > Creating basecall pipeline [05:02:31.078] [error] CUDA device requested but no devices found.

I am able to see the GPU using python's torch module

import torch torch.cuda.is_available() True torch.cuda.device_count() 1 print(torch.version.cuda) 11.7

System details: Details: dorado Version: 0.3.0+e2ba869 Build cuda_11.5.r11.5/compiler.30672275_0 AWS: ubuntu-pro-server/images/hvm-ssd/ubuntu-jammy-22.04-amd64-pro-server-20230531 lspci: 00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

Thank you for any help you can provide.

iiSeymour commented 1 year ago

@krpcem what do you see if you replace -x all with -x cuda:0?

krpcem commented 1 year ago

Thanks iiSeymour is cuda:0 the first GPU ? As you can see below it has a different issue.

ubuntu@ip-172-31-15-219:~$ dorado basecaller --emit-fastq  dna_r10.4.1_e8.2_260bps_fast@v4.1.0 -x cuda:0  /reads_volume/test
[2023-06-20 11:54:12.647] [info] > Creating basecall pipeline
[2023-06-20 11:54:28.051] [error] CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Exception raised from gemm<c10::Half> at ../aten/src/ATen/cuda/CUDABlas.cpp:446 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fc13ae5a6bb in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7fc13ae555ef in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x30aa92b (0x7fc13e6aa92b in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x30de70b (0x7fc13e6de70b in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #4: at::native::structured_mm_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) + 0x56 (0x7fc13e6df1d6 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x2dc04cc (0x7fc13e3c04cc in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x2dc0583 (0x7fc13e3c0583 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #7: at::_ops::mm::call(at::Tensor const&, at::Tensor const&) + 0xdb (0x7fc1944400ab in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x187f7df (0x7fc19387f7df in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #9: at::native::matmul(at::Tensor const&, at::Tensor const&) + 0x58 (0x7fc19387fd48 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x29ab4d3 (0x7fc1949ab4d3 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #11: at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) + 0xdb (0x7fc1945504bb in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x286cfe (0x5640cac2acfe in dorado)
frame #13: <unknown function> + 0x192721 (0x5640cab36721 in dorado)
frame #14: <unknown function> + 0x204d5f (0x5640caba8d5f in dorado)
frame #15: <unknown function> + 0x2051a9 (0x5640caba91a9 in dorado)
frame #16: <unknown function> + 0x1c7efb (0x5640cab6befb in dorado)
frame #17: <unknown function> + 0x1c802e (0x5640cab6c02e in dorado)
frame #18: <unknown function> + 0x2063f9 (0x5640cabaa3f9 in dorado)
frame #19: <unknown function> + 0x2069c9 (0x5640cabaa9c9 in dorado)
frame #20: <unknown function> + 0x1c7d2b (0x5640cab6bd2b in dorado)
frame #21: <unknown function> + 0x28761f (0x5640cac2b61f in dorado)
frame #22: <unknown function> + 0x280153 (0x5640cac24153 in dorado)
frame #23: <unknown function> + 0x27f8ca (0x5640cac238ca in dorado)
frame #24: <unknown function> + 0x13626d (0x5640caada26d in dorado)
frame #25: <unknown function> + 0x13adb8 (0x5640caadedb8 in dorado)
frame #26: <unknown function> + 0xdfc49 (0x5640caa83c49 in dorado)
frame #27: <unknown function> + 0x29d90 (0x7fc12f829d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x80 (0x7fc12f829e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: <unknown function> + 0xe6a95 (0x5640caa8aa95 in dorado)

krpcem commented 1 year ago

In the end I returned to guppy. I hope the next time I have a set to run Dorado is ready out of the box.

iiSeymour commented 1 year ago

Sorry @krpcem, the issue above is because dorado is picking up a libcublas.so.11 from somewhere else on your system and not the dorado lib directory. You can confirm that with ldd ./dorado-0.3.0/bin/dorado and the fix is to put ./dorado-0.3.0/lib first on your $LD_LIBRARY_PATH.

We are moving to static linking to avoid this problem very soon.

buhanfeng commented 1 year ago

Hello @iiSeymour , I encounter the same problem but with different situation: here is the error message from dorado:

my GPU and cuda info:

could your offer me some advice why dorado can't find my device? does it involves in version incompatibility problem? thanks.

SwapnilDoijad commented 1 year ago

dorado v0.3.1 same issue: [error] CUDA device requested but no devices found.

$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0

$dorado basecaller --emit-fastq -v dna_r10.4.1_e8.2_400bps_hac@v4.2.0 pod5/ > duplex.bam [2023-07-07 10:51:55.585] [info] > No duplex pairs file provided, pairing will be performed automatically [2023-07-07 10:51:57.124] [debug] > Reads to process: 3000000 [2023-07-07 10:52:00.162] [debug] Written 0 records. [2023-07-07 10:52:00.162] [error] CUDA device requested but no devices found. $

warthmann commented 1 year ago

Hello, I am encountering the same issue. I am attempting to use dorado through epi2me and docker. Other epi2me workflows work for me as does base calling through guppy using cuda from the commandline. I am on a ThinkPad T580 running ubuntu 20.04. Below is parts of the logfile of epi2me and a screenshot of docker invoking nvidia-smi. Any help is greatly appreciated. In any case, the workflow would benefit from a command to test and trouble shoot the setup. thanks a lot!

This is epi2me-labs/wf-basecalling v0.7.2.

[34/d9c940] Submitted process > getParams [d8/26b1e8] Submitted process > getVersions [ea/7aa6db] Submitted process > wf_dorado:dorado (3) [ec/d81b14] Submitted process > wf_dorado:dorado (1) ERROR ~ Error executing process > 'wf_dorado:dorado (3)' Caused by: Process wf_dorado:dorado (3) terminated with an error exit status (1) Command executed: set +e source /opt/nvidia/entrypoint.d/*-gpu-driver-check.sh # runtime driver check msg set -e dorado basecaller ${DRD_MODELS_PATH}/dna_r9.4.1_e8_hac@v3.3 . --device cuda:all | samtools view --no-PG -b -o 2.ubam - Command exit status: 1 Command output: (empty) Command error: [2023-07-08 14:43:32.251] [info] > Creating basecall pipeline [W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (function operator()) [2023-07-08 14:43:32.258] [error] CUDA device requested but no devices found. [main_samview] fail to read the header from "-". Work dir: /home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/work/ea/7aa6dbaac4484b161e5b896a7de3f5 Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh -- Check '/home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/nextflow.log' file for details WARN: Killing running tasks (1)

Screenshot from 2023-07-08 16-53-02

SamStudio8 commented 1 year ago

@warthmann I've opened your comment as an issue on the wf-basecalling repository: https://github.com/epi2me-labs/wf-basecalling/issues/12

tijyojwad commented 1 year ago

Hi @buhanfeng - it looks like you're running dorado as root but running nvidia-smi as your own user. Are you running dorado with a docker container or something similar? Can you check if nvidia-smi within that same environment is showing the GPUs?