out of memory core dump with dna_r10.4.1_e8.2_400bps_sup@v4.0.0

osilander commented 1 year ago

I'm trying to basecall on Ubuntu 20.04.5 LTS with two NVIDIA GeForce 3080. Basecalling with high accuracy dna_r10.4.1_e8.2_400bps_hac@v4.0.0 works fine. Using super high accuracy dna_r10.4.1_e8.2_400bps_sup@v4.0.0 results in an immediate out of memory core dump.

[2022-12-20 21:50:59.003] [info] > Creating basecall pipeline
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
  what():  CUDA out of memory. Tried to allocate 2.93 GiB (GPU 0; 9.78 GiB total capacity; 4.53 GiB already allocated; 2.19 GiB free; 5.97 GiB
 reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentat
ion for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:578 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fdb8e26a20e in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libc1
0.so)
frame #1: <unknown function> + 0x1667f (0x7fdb8e2d467f in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libc10_cuda.so)
frame #2: <unknown function> + 0x46528 (0x7fdb8e304528 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libc10_cuda.so)
frame #3: <unknown function> + 0x46752 (0x7fdb8e304752 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libc10_cuda.so)
frame #4: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryForma
t>) + 0x7bf (0x7fdb8fb156af in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #5: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x115 (
0x7fdba84f6ca5 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cpp.so)
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>,
c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x31 (0x7fdba84f6f01 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cud
a_cpp.so)
frame #7: at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x10f (0x7fdba84f706f in /home/olin/dorado-0.1.1+eb48766-Li
nux/bin/../lib/libtorch_cuda_cpp.so)
frame #8: <unknown function> + 0x2d720db (0x7fdb60e670db in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #9: <unknown function> + 0x2e55626 (0x7fdb60f4a626 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #10: at::TensorIteratorBase::fast_set_up(at::TensorIteratorConfig const&) + 0x191 (0x7fdb8fb58231 in /home/olin/dorado-0.1.1+eb48766-Lin
ux/bin/../lib/libtorch_cpu.so)
frame #11: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7a (0x7fdb8fb5c9ea in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib
/libtorch_cpu.so)
frame #12: at::TensorIteratorBase::build_unary_op(at::TensorBase const&, at::TensorBase const&) + 0x99 (0x7fdb8fb5dac9 in /home/olin/dorado-0.
1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #13: at::meta::structured_clamp::meta(at::Tensor const&, c10::OptionalRef<c10::Scalar>, c10::OptionalRef<c10::Scalar>) + 0x88 (0x7fdb900
102c8 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2dee815 (0x7fdb60ee3815 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #15: <unknown function> + 0x2dee963 (0x7fdb60ee3963 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #16: at::_ops::clamp::call(at::Tensor const&, c10::optional<c10::Scalar> const&, c10::optional<c10::Scalar> const&) + 0x15d (0x7fdb90613
1ed in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #17: dorado() [0x4fac8c]
frame #18: dorado() [0x5221dd]
frame #19: dorado() [0x5895c4]
frame #20: dorado() [0x588325]
frame #21: dorado() [0x587445]
frame #22: dorado() [0x584170]
frame #23: dorado() [0x571bde]
frame #24: dorado() [0x56be03]
frame #25: dorado() [0x5635cc]
frame #26: dorado() [0x589d7e]
frame #27: dorado() [0x5886e7]
frame #28: dorado() [0x587a09]
frame #29: dorado() [0x585fea]
frame #30: dorado() [0x571a1a]
frame #31: dorado() [0x65a8d1]
frame #32: dorado() [0x65a18b]
frame #33: dorado() [0x65d23a]
frame #34: dorado() [0x65d17c]
frame #35: dorado() [0x65d0eb]
frame #36: dorado() [0x65d078]
frame #37: dorado() [0x65d01c]
frame #38: <unknown function> + 0x145a0 (0x7fdbb73e85a0 in /home/olin/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda.so)
frame #39: <unknown function> + 0x8609 (0x7fdb5d967609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #40: clone + 0x43 (0x7fdb5d4fc133 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

iiSeymour commented 1 year ago

Hey @osilander

We have presets for 8GB cards and 12GB cards but not 10GB (i.e. NVIDIA GeForce 3080). You can see the batch size presets for 8GB cards here https://github.com/nanoporetech/dorado/blob/master/dorado/utils/cuda_utils.cpp#L137
I will add support but in the meantime, you can specify manually like so:

 $ dorado basecaller dna_r10.4.1_e8.2_260bps_sup@v4.0.0 pod5s/ -b 128 > calls.sam

incoherentian commented 1 year ago

Hey @osilander

We have presets for 8GB cards and 12GB cards but not 10GB (i.e. NVIDIA GeForce 3080). You can see the batch size presets for 8GB cards here https://github.com/nanoporetech/dorado/blob/master/dorado/utils/cuda_utils.cpp#L137 I will add support but in the meantime, you can specify manually like so:
 $ dorado basecaller dna_r10.4.1_e8.2_260bps_sup@v4.0.0 pod5s/ -b 128 > calls.sam

That matrix must have taken a decent amount of benchmarking to optimize, thanks to the dorado team for doing this! It is super, super handy. Other issues aside, the few dorado runs I've tried with standard configs have been a breeze for efficiency. Guppy was a lot of work to not waste GPU node time from inefficient VRAM+SM utilization. We already crossed that hurdle, but it is nice to see it is soon to (usually) be a thing of the past :)

For ease of reference: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

batch sizes for SUP basecalling -- from https://github.com/nanoporetech/dorado/blob/master/dorado/utils/cuda_utils.cpp#L137 VRAM | Batch size 08 GB | 128 12 GB | 192 16 GB | 256 24 GB | 512 32 GB | 640 40 GB | 1024

madhav-madhusoodanan commented 1 year ago

Hi

I have a 2048 mb vram CUDA device (NVIDIA MX350), would it be possible to make a preset for that?

Thank you in advance

incoherentian commented 1 year ago

I have a 2048 mb vram CUDA device (NVIDIA MX350), would it be possible to make a preset for that?

Pretty sure the MX series is old enough that it would be a no-go for Dorado. Unfortunately Dorado relies on CUDA compute capbility version 7 or higher. You'll have to stick with guppy. (You can use FAST guppy fine on your MX, right? To even think about SUP you might need to set --chunks_per_runner to something tiny, maybe 50 or less, and wait quite a while for even a flongle's data to basecall.)

Kirk3gaard commented 1 year ago

I see the same error on A10 cards. So would be great to include presets for 24 GB cards.

terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 7.81 GiB (GPU 0; 22.20 GiB total capacity; 11.94 GiB already allocated; 4.84 GiB free; 15.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

Will reduce the batch size according to your table and try again.

iiSeymour commented 1 year ago

@Kirk3gaard there is already a 24GB preset but I suspect we are just over here (GPU 0; 22.20 GiB) - can you run dorado again with -v and post the full command and output please.

Kirk3gaard commented 1 year ago

Tried to to run it on a node with multiple A10 cards.

Command:

dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v4.0.0 data/ --verbose  > NAME.sam

Output:

[2023-01-09 19:39:36.350] [info] > Creating basecall pipeline
[2023-01-09 19:39:42.624] [debug] - available GPU memory 23GB
[2023-01-09 19:39:45.964] [debug] - selected batchsize 512
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
  what():  CUDA out of memory. Tried to allocate 7.81 GiB (GPU 0; 22.20 GiB total capacity; 11.94 GiB already allocated; 4.84 GiB free; 15.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:578 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f7aafd2720e in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libc10.so)
frame #1: <unknown function> + 0x1667f (0x7f7aafd9167f in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libc10_cuda.so)
frame #2: <unknown function> + 0x46528 (0x7f7aafdc1528 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libc10_cuda.so)
frame #3: <unknown function> + 0x46752 (0x7f7aafdc1752 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libc10_cuda.so)
frame #4: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x7bf (0x7f7ab15d26af in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #5: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x115 (0x7f7ac9fb3ca5 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cpp.so)
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x31 (0x7f7ac9fb3f01 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cpp.so)
frame #7: at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x10f (0x7f7ac9fb406f in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cpp.so)
frame #8: <unknown function> + 0x2d720db (0x7f7a829240db in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #9: <unknown function> + 0x2e55626 (0x7f7a82a07626 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #10: at::TensorIteratorBase::fast_set_up(at::TensorIteratorConfig const&) + 0x191 (0x7f7ab1615231 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #11: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7a (0x7f7ab16199ea in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #12: at::TensorIteratorBase::build_unary_op(at::TensorBase const&, at::TensorBase const&) + 0x99 (0x7f7ab161aac9 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #13: at::meta::structured_clamp::meta(at::Tensor const&, c10::OptionalRef<c10::Scalar>, c10::OptionalRef<c10::Scalar>) + 0x88 (0x7f7ab1acd2c8 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2dee815 (0x7f7a829a0815 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #15: <unknown function> + 0x2dee963 (0x7f7a829a0963 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda_cu.so)
frame #16: at::_ops::clamp::call(at::Tensor const&, c10::optional<c10::Scalar> const&, c10::optional<c10::Scalar> const&) + 0x15d (0x7f7ab20d01ed in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cpu.so)
frame #17: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x4fac8c]
frame #18: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x5221dd]
frame #19: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x5895c4]
frame #20: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x588325]
frame #21: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x587445]
frame #22: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x584170]
frame #23: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x571bde]
frame #24: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x56be03]
frame #25: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x5635cc]
frame #26: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x589d7e]
frame #27: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x5886e7]
frame #28: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x587a09]
frame #29: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x585fea]
frame #30: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x571a1a]
frame #31: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x65a8d1]
frame #32: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x65a18b]
frame #33: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x65d23a]
frame #34: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x65d17c]
frame #35: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x65d0eb]
frame #36: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x65d078]
frame #37: /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/dorado() [0x65d01c]
frame #38: <unknown function> + 0x145a0 (0x7f7ad8ea55a0 in /home/bio.aau.dk/ur36rv/software/dorado-0.1.1+eb48766-Linux/bin/../lib/libtorch_cuda.so)
frame #39: <unknown function> + 0x8609 (0x7f7a7f424609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #40: clone + 0x43 (0x7f7a7eff1133 in /lib/x86_64-linux-gnu/libc.so.6)

/var/spool/slurm/d/job17953/slurm_script: line 49: 65767 Aborted                 (core dumped) $DORADO basecaller $MODEL $POD5 --verbose > $TMPDIR/$NAME.sam

Kirk3gaard commented 1 year ago

Seems to select a different batch sizes when I have tried to run it several times on RTX 4090 card with 24 GB ram. Eventually it picked one that was low enough for it to complete.

iiSeymour commented 1 year ago

GPU memory allocation is much better in recent releases of dorado and is no longer based on supported presets.

nanoporetech / dorado

out of memory core dump with dna_r10.4.1_e8.2_400bps_sup@v4.0.0 #64