sjaenick commented 8 months ago

Issue Report

Please describe the issue:

With one dorado instance already running and processing data, I attempted to start a second one, assuming that either the GPU would be shared or to receive an error that the CUDA device is busy. Instead, the second instance segfaulted, while the first one continued to process data.

$ dorado basecaller         --emit-fastq         --min-qscore 15         --trim primers         --sample-sheet sample_sheet_FAX85449_20231220_1156_12b9e8fa.csv         sup pod5_pass         > basecalls.fastq2
[2024-02-27 12:39:14.796] [info]  - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-02-27 12:39:14.870] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-02-27 12:39:16.808] [info] > Creating basecall pipeline
[2024-02-27 12:39:17.352] [warning] Auto batchsize detection failed. Less than 1GB GPU memory available.
[2024-02-27 12:39:17.355] [info]  - set batch size for cuda:0 to 64
[2024-02-27 12:39:17.453] [warning] Caught Torch error 'CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 11.76 GiB total capacity; 89.90 MiB already allocated; 612.12 MiB free; 132.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::OutOfMemoryError'
  what():  CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 11.76 GiB total capacity; 91.12 MiB already allocated; 592.12 MiB free; 152.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /pytorch/pyold/c10/cuda/CUDACachingAllocator.cpp:913 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f15236229b7 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: <unknown function> + 0xa9f8645 (0x7f15235e2645 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: <unknown function> + 0xa9f893e (0x7f15235e293e in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0xa9f8cce (0x7f15235e2cce in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x4530bc1 (0x7f151d11abc1 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x14 (0x7f151d114604 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x111 (0x7f152156cf01 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: at::detail::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x31 (0x7f152156d1d1 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1f (0x7f15216145af in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #9: <unknown function> + 0xa61a339 (0x7f1523204339 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #10: <unknown function> + 0xa61a41b (0x7f152320441b in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe7 (0x7f151df806e7 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x56c718f (0x7f151e2b118f in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1b2 (0x7f151dfc0922 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #14: dorado() [0x9a2a8f]
frame #15: dorado() [0x9bf552]
frame #16: dorado() [0x9c3b91]
frame #17: dorado() [0x9a10c9]
frame #18: dorado() [0x9a11f8]
frame #19: dorado() [0x9a1450]
frame #20: dorado() [0x9a577b]
frame #21: <unknown function> + 0x1196e380 (0x7f152a558380 in /home/jaes/dorado-0.5.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #22: <unknown function> + 0x8609 (0x7f1518783609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x43 (0x7f1517e98353 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Run environment:

Dorado version: 0.5.3 (from https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.5.3-linux-x64.tar.gz)
Dorado command: dorado basecaller --emit-fastq --min-qscore 15 --trim primers --sample-sheet sample_sheet_FAX85449_20231220_1156_12b9e8fa.csv sup pod5_pass
Operating system: Ubuntu 20.04.6 LTS
Hardware (CPUs, Memory, GPUs): Ryzen 9 5900X, 16GB memory, Geforce RTX 3080 Ti
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): POD5
Source data location (on device or networked drive - NFS, etc.): local NVMe drive
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 54GB of .pod5 files

malton-ont commented 8 months ago

Hi @sjaenick,

Thanks for your report. The relevant lines here are:

[2024-02-27 12:39:17.352] [warning] Auto batchsize detection failed. Less than 1GB GPU memory available.
[2024-02-27 12:39:17.355] [info]  - set batch size for cuda:0 to 64
[2024-02-27 12:39:17.453] [warning] Caught Torch error 'CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 11.76 GiB total capacity; 89.90 MiB already allocated; 612.12 MiB free; 132.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::OutOfMemoryError'

Dorado is telling you that you have < 1GB of memory available and that it has set the batch size to 64 (which is the minimum batch size possible due to the granularity of the CUDA kernels). It turns out that this still requires > 1GB of memory, which is why this is failing. We can certainly improve things by checking that the selected batch size will not require more memory than is available and providing a less verbose error message than libtorch does in this instance.

sjaenick commented 8 months ago

Thanks for the fast reply; I guess any kind of meaningful message and an exit code != 0 (but no segfault) would be fine.

nanoporetech / dorado

CUDA-related segfault when attempting to invoke a second dorado process #653

Issue Report

Please describe the issue:

Run environment: