nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
538 stars 65 forks source link

cuDNN error: CUDNN_STATUS_NOT_INITIALIZED when I run methylation calling #1129

Closed ymb943 closed 1 week ago

ymb943 commented 1 week ago

Issue Report

Please describe the issue:

Please provide a clear and concise description of the issue you are seeing and the result you expect.

Steps to reproduce the issue:

I used the following command for methylation calling

/BiO/dorado-0.8.3-linux-x64/bin/dorado basecaller /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 /BiO/T2T_KOREF/20241105/2024-09-26_KOREF1_UL_PAY09654_PAY09817/KOREF1/20240926_1205_P2S-01506-A_PAY09654_90f482e1/pod5 --modified-bases-models /BiO/UIPA_20241113/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mC_5hmC@v2.0.1 > ./20240926_1205_P2S-01506-A_PAY09654_90f482e1.5mC_5hmC.bam

And I got the error of CUDNN_STATUS_NOT_INITIALIZED. Calling process wouldn't start with that error. We confirmed that cuDNN is installed.

Run environment:

Logs

[2024-11-13 18:37:38.877] [info] Running: "basecaller" "/BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "/BiO/T2T_KOREF/20241105/2024-09-26_KOREF1_UL_PAY09654_PAY09817/KOREF1/20240926_1205_P2S-01506-A_PAY09654_90f482e1/pod5" "--modified-bases-models" "/BiO/UIPA_20241113/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mC_5hmC@v2.0.1"
[2024-11-13 18:37:40.493] [info] > Creating basecall pipeline
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:25.891] [info] Calculating optimized batch size for GPU "NVIDIA A100-SXM4-40GB" and model /BiO/UIPA_20241112/dna_r10.4.1_e8.2_400bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2024-11-13 18:38:31.913] [info] cuda:6 using chunk size 12288, batch size 416
[2024-11-13 18:38:31.914] [info] cuda:7 using chunk size 12288, batch size 384
[2024-11-13 18:38:31.914] [info] cuda:5 using chunk size 12288, batch size 384
[2024-11-13 18:38:31.914] [info] cuda:3 using chunk size 12288, batch size 416
[2024-11-13 18:38:32.082] [info] cuda:6 using chunk size 6144, batch size 896
[2024-11-13 18:38:32.121] [info] cuda:3 using chunk size 6144, batch size 864
[2024-11-13 18:38:32.129] [info] cuda:5 using chunk size 6144, batch size 864
[2024-11-13 18:38:32.131] [info] cuda:7 using chunk size 6144, batch size 864
[2024-11-13 18:38:33.521] [info] cuda:1 using chunk size 12288, batch size 384
[2024-11-13 18:38:33.521] [info] cuda:2 using chunk size 12288, batch size 416
[2024-11-13 18:38:33.521] [info] cuda:4 using chunk size 12288, batch size 384
[2024-11-13 18:38:33.521] [info] cuda:0 using chunk size 12288, batch size 416
[2024-11-13 18:38:33.656] [info] cuda:4 using chunk size 6144, batch size 864
[2024-11-13 18:38:33.675] [info] cuda:2 using chunk size 6144, batch size 768
[2024-11-13 18:38:33.676] [info] cuda:1 using chunk size 6144, batch size 864
[2024-11-13 18:38:33.678] [info] cuda:0 using chunk size 6144, batch size 864
terminate called after throwing an instance of 'c10::CuDNNError'
  what():  cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Exception raised from createCuDNNHandle at /pytorch/pyold/aten/src/ATen/cudnn/Handle.cpp:9 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f71bd81a9b7 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: <unknown function> + 0x3f285b4 (0x7f71b6d0a5b4 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: at::native::getCudnnHandle() + 0x725 (0x7f71bb8061c5 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x89bf0f6 (0x7f71bb7a10f6 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x89c00db (0x7f71bb7a20db in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0x89a54ca (0x7f71bb7874ca in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x96 (0x7f71bb787b16 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa632127 (0x7f71bd414127 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa6321e0 (0x7f71bd4141e0 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x23d (0x7f71b8309e9d in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x1505 (0x7f71b76c8cf5 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: <unknown function> + 0x58a7496 (0x7f71b8689496 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x58a7517 (0x7f71b8689517 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool) + 0x29b (0x7f71b7eac0fb in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x21d (0x7f71b76bcd3d in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #15: <unknown function> + 0x58a6f55 (0x7f71b8688f55 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #16: <unknown function> + 0x58a6fbf (0x7f71b8688fbf in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #17: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) + 0x223 (0x7f71b7eab443 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #18: at::native::conv1d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x1c5 (0x7f71b76bff35 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #19: <unknown function> + 0x5a57b31 (0x7f71b8839b31 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #20: at::_ops::conv1d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x20c (0x7f71b830798c in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #21: torch::nn::Conv1dImpl::forward(at::Tensor const&) + 0x3a0 (0x7f71bae13a40 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #22: /BiO/dorado-0.8.3-linux-x64/bin/dorado() [0xacf7e8]
frame #23: /BiO/dorado-0.8.3-linux-x64/bin/dorado() [0xad4388]
frame #24: /BiO/dorado-0.8.3-linux-x64/bin/dorado() [0xac0360]
frame #25: /BiO/dorado-0.8.3-linux-x64/bin/dorado() [0xac04c8]
frame #26: /BiO/dorado-0.8.3-linux-x64/bin/dorado() [0xabc953]
frame #27: <unknown function> + 0x1196e380 (0x7f71c4750380 in /BiO/dorado-0.8.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #28: <unknown function> + 0x8609 (0x7f71b2218609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #29: clone + 0x43 (0x7f71b1dd5163 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
HalfPhoton commented 1 week ago

Hi @ymb943, Can you try manually setting the --batchsize 384? See also troubleshooting CUDA OOM.

Best regards, Rich