nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
441 stars 54 forks source link

dorado duplex: CUDA error `c10_cuda_check_implementation` #850

Open caspargross opened 1 month ago

caspargross commented 1 month ago

Issue Report

Please describe the issue:

I am trying to run dorado duplex with the recommended approach to split up the pod5 in individual channels. The error occurs while running dorado duplex on a singe channel .pod5. I use a 4khz HAC model and also want to call modifications.

All steps are run as part of a snakemake pipeline. The first job runs fine and produces a bamfile but has an error in the logfile and throws soft lockup kernel errors. All subsequent jobs, using the same run command and profiles stop with a CUDA error. The second CUDA error seems to be similar to the one reported by other people here (#849, #842) . I dont know if the first and second erros are related.

Steps to reproduce the issue:

Run multiple dorado duplex basecallings on split channel .pod5.

Run environment:

Logs

The first job in the pipeline has the following error but still produces a bamfile:

[2024-05-29 09:20:36.874] [info] > Starting Stereo Duplex pipeline
[2024-05-29 09:20:36.877] [debug] BasecallerNode chunk size 10000
[2024-05-29 09:20:36.895] [debug] Using dorado duplex channel count of 10
[2024-05-29 09:20:36.910] [debug] BasecallerNode chunk size 10000
[2024-05-29 09:20:36.921] [info] > Reading read channel info
[2024-05-29 09:20:36.922] [info] > Processed read channel info
[2024-05-29 09:20:36.924] [debug] Sort channel 2216
[2024-05-29 09:20:36.924] [debug] Sorted channel 2216
[2024-05-29 09:20:40.262] [error] basic_string::substr: __pos (which is 21065) > this->size() (which is 1604)
[2024-05-29 09:20:40.262] [error] basic_string::substr: __pos (which is 11643) > this->size() (which is 9568)
[2024-05-29 09:20:40.347] [info] > Simplex reads basecalled: 328
[2024-05-29 09:20:40.347] [info] > Simplex reads filtered: 57
[2024-05-29 09:20:40.347] [info] > Duplex reads basecalled: 35
[2024-05-29 09:20:40.347] [info] > Duplex rate: 21.222548%
[2024-05-29 09:20:40.347] [info] > Basecalled @ Bases/s: 1.039055e+05

Along with following kernel erros:

Message from syslogd@SRV010 at May 29 09:21:21 ...
kernel:[167914.357717] watchdog: BUG: soft lockup - CPU#117 stuck for 22s! [dorado:2186819]

Message from syslogd@SRV010 at May 29 09:21:49 ...
kernel:[167942.357288] watchdog: BUG: soft lockup - CPU#117 stuck for 22s! [dorado:2186819]

Message from syslogd@SRV010 at May 29 09:22:13 ...
kernel:[167966.352921] watchdog: BUG: soft lockup - CPU#115 stuck for 22s! [dorado:2186819]

All downstram jobs then fail with the following errors:

[2024-05-29 09:25:43.999] [debug] cuda:1 Decode memory 5.35GB
[2024-05-29 09:25:45.392] [error] finalise() not called on a HtsFile.
[2024-05-29 09:25:45.399] [error] CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8d113db9b7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8d0a960115 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8d113a5958 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897dd13 (0x7f8d0f320d13 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x4530bc1 (0x7f8d0aed3bc1 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x14 (0x7f8d0aecd604 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::detail::empty_cpu(c10::ArrayRef<long>, c10::ScalarType, bool, c10::optional<c10::MemoryFormat>) + 0x40 (0x7f8d0aecd650 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: at::detail::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x34 (0x7f8d0aecd6a4 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: at::native::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1f (0x7f8d0b5c8e0f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: <unknown function> + 0x5703e31 (0x7f8d0c0a6e31 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x5703e8b (0x7f8d0c0a6e8b in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe7 (0x7f8d0bd396e7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x56c718f (0x7f8d0c06a18f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1b2 (0x7f8d0bd79922 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x8ab83f]
frame #15: bin/dorado-0.7.0-linux-x64/bin/dorado() [0xa2ec3f]
frame #16: bin/dorado-0.7.0-linux-x64/bin/dorado() [0xa2d4a6]
frame #17: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x86bdfd]
frame #18: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x829ee3]
frame #19: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x4ccd9d]
frame #20: __libc_start_main + 0xf3 (0x7f8d0589b083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: bin/dorado-0.7.0-linux-x64/bin/dorado() [0x7f7b57]
HalfPhoton commented 1 month ago

Hi @caspargross,

Your duplex command is malformed:

dorado duplex ... --modified-bases-models <mods_model> <simplex_model> <data>

This arrangement results in the simplex model being consumed by the --modified-bases-models argument which accepts multiple inputs. To resolve, please put the simplex model and data before the optional arguments.

dorado duplex <simpelx_model> <data>  ... --modified-bases-models <mods_model>

As for the substr error we'll take a look at this internally as it's come up in a couple of places.

Kind regards, Rich

caspargross commented 1 month ago

Hi @HalfPhoton ,

thanks for your help. I changed the code as you suggested but it did not solve the issues. However I found out that the basecalling works fine using only one GPU. To me it seems that there is a problem with CUDA ressource allocation or deallocation when running duplex basecalling on multiple devices. For me the issue can be solved by parallelizing multiple jobs using only a single CUDA device. Best regards, Caspar

caspargross commented 3 weeks ago

I am reopening this Issue because this problem is still not solved. I tried the following points:

Here is one of the failing pod5 files: https://cloud.imgag.de/s/bgC6CPg7ciTf9df

Logfile:

[2024-06-06 15:49:52.953] [info] Running: "duplex" "bin/dorado_models/dna_r10.4.1_e8.2_400bps_hac@v4.1.0" "/tmp/dorado_duplex.HG001_TUE_01/channel_mod-65.pod5" "--device" "cuda:1" "--min-qscore" "9" "--verbose" "--batchsize" "5000" "--chunksize" "9216"
[2024-06-06 15:49:52.988] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-06-06 15:49:52.989] [debug] > Reads to process: 255715
[2024-06-06 15:49:53.246] [info] Normalised: chunksize 9216 -> 9215
[2024-06-06 15:49:53.247] [info] Normalised: chunksize 9216 -> 9215
[2024-06-06 15:49:53.433] [debug] cuda:1 memory available: 84.15GB
[2024-06-06 15:49:53.433] [debug] cuda:1 memory limit 74.74GB
[2024-06-06 15:49:53.433] [debug] cuda:1 maximum safe estimated batch size at chunk size 9215 is 11648
[2024-06-06 15:49:53.433] [info] cuda:1 using chunk size 9215, batch size 5056
[2024-06-06 15:49:53.433] [debug] cuda:1 Model memory 21.54GB
[2024-06-06 15:49:53.433] [debug] cuda:1 Decode memory 10.83GB
[2024-06-06 15:49:54.170] [debug] cuda:1 memory available: 83.82GB
[2024-06-06 15:49:54.170] [debug] cuda:1 memory limit 40.91GB
[2024-06-06 15:49:54.170] [debug] cuda:1 maximum safe estimated batch size at chunk size 9215 is 13248
[2024-06-06 15:49:54.170] [info] cuda:1 using chunk size 9215, batch size 5056
[2024-06-06 15:49:54.170] [debug] cuda:1 Model memory 11.93GB
[2024-06-06 15:49:54.170] [debug] cuda:1 Decode memory 3.67GB
[2024-06-06 15:49:54.560] [error] finalise() not called on a HtsFile.
[2024-06-06 15:49:54.561] [error] CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe0837aa9b7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fe07cd2f115 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fe083774958 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x897dd13 (0x7fe0816efd13 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x4530bc1 (0x7fe07d2a2bc1 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x14 (0x7fe07d29c604 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::detail::empty_cpu(c10::ArrayRef<long>, c10::ScalarType, bool, c10::optional<c10::MemoryFormat>) + 0x40 (0x7fe07d29c650 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: at::detail::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x34 (0x7fe07d29c6a4 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: at::native::empty_cpu(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1f (0x7fe07d997e0f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: <unknown function> + 0x5703e31 (0x7fe07e475e31 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0x5703e8b (0x7fe07e475e8b in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe7 (0x7fe07e1086e7 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x56c718f (0x7fe07e43918f in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1b2 (0x7fe07e148922 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x8b2b7f]
frame #15: bin/dorado-0.7.1-linux-x64/bin/dorado() [0xa3c03f]
frame #16: bin/dorado-0.7.1-linux-x64/bin/dorado() [0xa3aac6]
frame #17: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x87492d]
frame #18: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x82739a]
frame #19: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x8310ab]
frame #20: <unknown function> + 0x114df (0x7fe0789144df in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x826569]
frame #22: bin/dorado-0.7.1-linux-x64/bin/dorado() [0x837e00]
frame #23: <unknown function> + 0x1196e380 (0x7fe08a6e0380 in /mnt/storage3/projects/research/2024_lonGER/bin/dorado-0.7.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #24: <unknown function> + 0x8609 (0x7fe07890b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #25: clone + 0x43 (0x7fe077d65353 in /lib/x86_64-linux-gnu/libc.so.6)

Execution command:

CUDA_LAUNCH_BLOCKING=1  \
bin/dorado-0.7.1-linux-x64/bin/dorado duplex \
bin/dorado_models/dna_r10.4.1_e8.2_400bps_hac@v4.1.0 \
/tmp/dorado_duplex.HG001_TUE_01/channel_mod-65.pod5 \
--device 'cuda:1' \
--min-qscore 9 \
--verbose \
--batchsize 5000 \
--chunksize 9216 \
> /tmp/dorado_duplex.HG001_TUE_01/channel_mod-65.bam  \
2> logs/dorado_duplex.HG001_TUE_01.channel_mod-65.log
caspargross commented 3 weeks ago

Update: It works with --chunksize 5000 and --batchsize 5000. I saw mentionned somewhere that changing batchsize has an impact on the basecalling results however.

I needed to reduce the chunksize further and further to basecall the last channel subsets. For the last file I had to use chunksize 2000 and batchsize 5000. GPU memory usage is only at 16% with those settings.