Closed krpcem closed 1 year ago
@krpcem what do you see if you replace -x all
with -x cuda:0
?
Thanks iiSeymour is cuda:0 the first GPU ? As you can see below it has a different issue.
ubuntu@ip-172-31-15-219:~$ dorado basecaller --emit-fastq dna_r10.4.1_e8.2_260bps_fast@v4.1.0 -x cuda:0 /reads_volume/test
[2023-06-20 11:54:12.647] [info] > Creating basecall pipeline
[2023-06-20 11:54:28.051] [error] CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Exception raised from gemm<c10::Half> at ../aten/src/ATen/cuda/CUDABlas.cpp:446 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fc13ae5a6bb in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7fc13ae555ef in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x30aa92b (0x7fc13e6aa92b in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x30de70b (0x7fc13e6de70b in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #4: at::native::structured_mm_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) + 0x56 (0x7fc13e6df1d6 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x2dc04cc (0x7fc13e3c04cc in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x2dc0583 (0x7fc13e3c0583 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cuda.so)
frame #7: at::_ops::mm::call(at::Tensor const&, at::Tensor const&) + 0xdb (0x7fc1944400ab in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x187f7df (0x7fc19387f7df in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #9: at::native::matmul(at::Tensor const&, at::Tensor const&) + 0x58 (0x7fc19387fd48 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x29ab4d3 (0x7fc1949ab4d3 in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #11: at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) + 0xdb (0x7fc1945504bb in /home/ubuntu/dorado/dorado/3rdparty/torch-2.0.0-Linux/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x286cfe (0x5640cac2acfe in dorado)
frame #13: <unknown function> + 0x192721 (0x5640cab36721 in dorado)
frame #14: <unknown function> + 0x204d5f (0x5640caba8d5f in dorado)
frame #15: <unknown function> + 0x2051a9 (0x5640caba91a9 in dorado)
frame #16: <unknown function> + 0x1c7efb (0x5640cab6befb in dorado)
frame #17: <unknown function> + 0x1c802e (0x5640cab6c02e in dorado)
frame #18: <unknown function> + 0x2063f9 (0x5640cabaa3f9 in dorado)
frame #19: <unknown function> + 0x2069c9 (0x5640cabaa9c9 in dorado)
frame #20: <unknown function> + 0x1c7d2b (0x5640cab6bd2b in dorado)
frame #21: <unknown function> + 0x28761f (0x5640cac2b61f in dorado)
frame #22: <unknown function> + 0x280153 (0x5640cac24153 in dorado)
frame #23: <unknown function> + 0x27f8ca (0x5640cac238ca in dorado)
frame #24: <unknown function> + 0x13626d (0x5640caada26d in dorado)
frame #25: <unknown function> + 0x13adb8 (0x5640caadedb8 in dorado)
frame #26: <unknown function> + 0xdfc49 (0x5640caa83c49 in dorado)
frame #27: <unknown function> + 0x29d90 (0x7fc12f829d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x80 (0x7fc12f829e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #29: <unknown function> + 0xe6a95 (0x5640caa8aa95 in dorado)
In the end I returned to guppy. I hope the next time I have a set to run Dorado is ready out of the box.
Sorry @krpcem, the issue above is because dorado is picking up a libcublas.so.11
from somewhere else on your system and not the dorado lib
directory. You can confirm that with ldd ./dorado-0.3.0/bin/dorado
and the fix is to put ./dorado-0.3.0/lib
first on your $LD_LIBRARY_PATH
.
We are moving to static linking to avoid this problem very soon.
Hello @iiSeymour , I encounter the same problem but with different situation: here is the error message from dorado:
my GPU and cuda info:
could your offer me some advice why dorado can't find my device? does it involves in version incompatibility problem? thanks.
dorado v0.3.1 same issue: [error] CUDA device requested but no devices found.
$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0
$dorado basecaller --emit-fastq -v dna_r10.4.1_e8.2_400bps_hac@v4.2.0 pod5/ > duplex.bam [2023-07-07 10:51:55.585] [info] > No duplex pairs file provided, pairing will be performed automatically [2023-07-07 10:51:57.124] [debug] > Reads to process: 3000000 [2023-07-07 10:52:00.162] [debug] Written 0 records. [2023-07-07 10:52:00.162] [error] CUDA device requested but no devices found. $
Hello, I am encountering the same issue. I am attempting to use dorado through epi2me and docker. Other epi2me workflows work for me as does base calling through guppy using cuda from the commandline. I am on a ThinkPad T580 running ubuntu 20.04. Below is parts of the logfile of epi2me and a screenshot of docker invoking nvidia-smi. Any help is greatly appreciated. In any case, the workflow would benefit from a command to test and trouble shoot the setup. thanks a lot!
[34/d9c940] Submitted process > getParams
[d8/26b1e8] Submitted process > getVersions
[ea/7aa6db] Submitted process > wf_dorado:dorado (3)
[ec/d81b14] Submitted process > wf_dorado:dorado (1)
ERROR ~ Error executing process > 'wf_dorado:dorado (3)'
Caused by:
Process wf_dorado:dorado (3)
terminated with an error exit status (1)
Command executed:
set +e
source /opt/nvidia/entrypoint.d/*-gpu-driver-check.sh # runtime driver check msg
set -e
dorado basecaller ${DRD_MODELS_PATH}/dna_r9.4.1_e8_hac@v3.3 . --device cuda:all | samtools view --no-PG -b -o 2.ubam -
Command exit status:
1
Command output:
(empty)
Command error:
[2023-07-08 14:43:32.251] [info] > Creating basecall pipeline
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (function operator())
[2023-07-08 14:43:32.258] [error] CUDA device requested but no devices found.
[main_samview] fail to read the header from "-".
Work dir:
/home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/work/ea/7aa6dbaac4484b161e5b896a7de3f5
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh
-- Check '/home/pbgl/epi2melabs/instances/wf-basecalling_bc8d8970-d522-48c1-83f4-507f9f3ccab1/nextflow.log' file for details
WARN: Killing running tasks (1)
@warthmann I've opened your comment as an issue on the wf-basecalling repository: https://github.com/epi2me-labs/wf-basecalling/issues/12
Hi @buhanfeng - it looks like you're running dorado
as root but running nvidia-smi
as your own user. Are you running dorado with a docker container or something similar? Can you check if nvidia-smi
within that same environment is showing the GPUs?
I get the same issue:
Could you give me some advice ?
[2024-05-21 01:10:14.073] [info] Running: "basecaller" "../../Data/rna002_70bps_hac@v3/" "../../sx-50h/pass_data/" "--min-qscore" "1" "--resume-from" "in
[2024-05-21 01:10:14.108] [info] > Creating basecall pipeline
[2024-05-21 01:13:05.497] [info] - BAM format does not support U
, so RNA output files will include T
instead of U
for all file types.
[2024-05-21 01:13:08.491] [error] CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b32c25159b7 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/li
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x2b32bba9a115 in /public/home/software/opt/bio
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x2b32c24df958 in /public/home/software/opt/bio/sof
frame #3: void at::native::gpu_kernel_impl<at::native::FillFunctor
frame #6: at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) + 0x20 (0x2b32c0cf4f00 in /public/home/software/opt/bio/software/dorado/0
frame #7:
frame #8:
frame #9: at::_ops::fill_Scalar::call(at::Tensor&, c10::Scalar const&) + 0x12c (0x2b32bcbb092c in /public/home/software/opt/bio/software/dorado/0.6.1/bi
frame #10: at::native::zero(at::Tensor&) + 0xa7 (0x2b32bc45fa67 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_lib.so
frame #11:
frame #12: at::ops::zero::call(at::Tensor&) + 0x129 (0x2b32bcfed499 in /public/home/software/opt/bio/software/dorado/0.6.1/bin/../lib/libdorado_torch_l
frame #13: at::native::zeros_symint(c10::ArrayRef
frame #15: at::_ops::zeros::redispatch(c10::DispatchKeySet, c10::ArrayRef
frame #17: at::_ops::zeros::call(c10::ArrayRef
frame #21:
frame #22:
frame #23: at::_ops::_cudnn_rnn_flatten_weight::call(c10::ArrayRef
frame #25: torch::nn::detail::RNNImplBase
frame #31: dorado() [0x9eacd1]
frame #32: dorado() [0x939815]
frame #33: dorado() [0x85898f]
frame #34: dorado() [0x85873b]
frame #35: pthread_once + 0x50 (0x2b33247e7e70 in /lib64/libpthread.so.0)
frame #36: dorado() [0x858daf]
frame #37: dorado() [0x85b180]
frame #38:
frame #39:
frame #40: clone + 0x6d (0x2b3325c8cbad in /lib64/libc.so.6)
Hi @wangguiqian - my guess is you're running an older GPU that's unsupported by dorado. Can you post a screenshot of nvidia-smi
?
my GPU version :cuda 11.4
gpu01 P100,gpu02 K40m
Unfortunately neither of those architecture are supported by dorado. Dorado is built for acceleration on Volta and newer GPUs.
Because I can only use the LSF server to schedule the Linux operating system, and it is impossible to connect to the network for security reasons. How can I check the characteristics of my GPU version in the LSF scheduling system?
@tijyojwad thank you very much for your help
you should be able to runnvidia-smi
in an LSF scheduled job to get the same output that lists GPUs and driver/cuda version.
I run nvidia-smi in GPU LSF scheduled ,the result:
Thank you very much for you @tijyojwad
looks like this is P100 too, which isn't supported
Is there any other way for me to use basecaller? Is it possible with guppy? My data is Direct-RNA sequencing using the RNA002 chip. Thank you so much
Hello,
I downloaded dorado today from github and installed it using the instructions there. I was able to download a model using dorado. When I attemopted base calling, however CUDA devices coutl not be found. They are found however using pytorch.
When I attempt to run base calling it ends quickly dorado basecaller --emit-fastq dna_r10.4.1_e8.2_260bps_fast@v4.1.0 -x all /reads_volume/test [05:02:31.076] [info] > Creating basecall pipeline [05:02:31.078] [error] CUDA device requested but no devices found.
I am able to see the GPU using python's torch module
System details: Details: dorado Version: 0.3.0+e2ba869 Build cuda_11.5.r11.5/compiler.30672275_0 AWS: ubuntu-pro-server/images/hvm-ssd/ubuntu-jammy-22.04-amd64-pro-server-20230531 lspci: 00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
Thank you for any help you can provide.