nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
488 stars 59 forks source link

Duplex basecalling returns 'CUDA out of memory' error in v0.5.x #594

Closed fayora closed 4 months ago

fayora commented 8 months ago

Hi, we have been running duplex basecalling without issues in v0.4.x, but all versions of v0.5.x we have tested (i.e., .0, .1 and .2) throw the same CUDA out-of-memory error (see below). This error happens on the same V100 GPUs where we run duplex basecalling on v0.4.x without errors.

The error is:

dorado duplex --recursive --min-qscore 7 /models/dna_r10.4.1_e8.2_400bps_sup\@v4.1.0 /input > doradoDuplexOut.bam

[2024-01-23 02:11:13.901] [info] > No duplex pairs file provided, pairing will be performed automatically [2024-01-23 02:11:31.090] [info] - set batch size for cuda:0 to 512 [2024-01-23 02:11:31.131] [info] - set batch size for cuda:1 to 512 [2024-01-23 02:11:31.172] [info] - set batch size for cuda:2 to 512 [2024-01-23 02:11:31.199] [info] - set batch size for cuda:3 to 320 [2024-01-23 02:11:33.077] [info] - set batch size for cuda:0 to 1920 [2024-01-23 02:11:33.337] [info] - set batch size for cuda:1 to 640 [2024-01-23 02:11:33.848] [info] - set batch size for cuda:2 to 1280 [2024-01-23 02:11:34.357] [info] - set batch size for cuda:3 to 1920 [2024-01-23 02:11:34.357] [info] > Starting Stereo Duplex pipeline [2024-01-23 02:11:34.494] [info] > Reading read channel info [2024-01-23 02:11:36.731] [info] > Processed read channel info [2024-01-23 02:12:54.939] [warning] Caught Torch error 'CUDA out of memory. Tried to allocate 4.03 GiB (GPU 2; 15.78 GiB total capacity; 8.50 GiB already allocated; 2.81 GiB free; 12.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF', clearing CUDA cache and retrying. terminate called after throwing an instance of 'c10::OutOfMemoryError' what(): CUDA out of memory. Tried to allocate 4.03 GiB (GPU 2; 15.78 GiB total capacity; 8.50 GiB already allocated; 2.81 GiB free; 12.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Exception raised from malloc at /pytorch/pyold/c10/cuda/CUDACachingAllocator.cpp:913 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f39346fe9b7 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #1: + 0xa9f8645 (0x7f39346be645 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #2: + 0xa9f893e (0x7f39346be93e in /dorado/bin/../lib/libdorado_torch_lib.so) frame #3: + 0xa9f8cce (0x7f39346becce in /dorado/bin/../lib/libdorado_torch_lib.so) frame #4: + 0x4530bc1 (0x7f392e1f6bc1 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #5: at::detail::empty_generic(c10::ArrayRef, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional) + 0x14 (0x7f392e1f0604 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #6: at::detail::empty_cuda(c10::ArrayRef, c10::ScalarType, c10::optional, c10::optional) + 0x111 (0x7f3932648f01 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #7: at::detail::empty_cuda(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0x31 (0x7f39326491d1 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #8: at::native::empty_cuda(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0x1f (0x7f39326f05af in /dorado/bin/../lib/libdorado_torch_lib.so) frame #9: + 0xa61a339 (0x7f39342e0339 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #10: + 0xa61a41b (0x7f39342e041b in /dorado/bin/../lib/libdorado_torch_lib.so) frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0xe7 (0x7f392f05c6e7 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #12: + 0x56c718f (0x7f392f38d18f in /dorado/bin/../lib/libdorado_torch_lib.so) frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0x1b2 (0x7f392f09c922 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #14: dorado() [0x91e4fb] frame #15: dorado() [0x9a9fb1] frame #16: dorado() [0x9a0e67] frame #17: dorado() [0x9a510b] frame #18: + 0x1196e380 (0x7f393b634380 in /dorado/bin/../lib/libdorado_torch_lib.so) frame #19: + 0x8609 (0x7f392985f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) frame #20: clone + 0x43 (0x7f3928f74133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

If we run an identical command, with "basecaller" instead of "duplex", the analysis on v0.5.x completes without issues.

Server has 4x V100 GPUs, details here:

nvidia-smi

Tue Jan 23 02:47:59 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   42C    P0    38W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000002:00:00.0 Off |                    0 |
| N/A   41C    P0    39W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000003:00:00.0 Off |                    0 |
| N/A   41C    P0    38W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000004:00:00.0 Off |                    0 |
| N/A   41C    P0    38W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Thanks for the help!

malton-ont commented 8 months ago

Hi @fayora. Can you try the workaround from this comment?

fayora commented 8 months ago

Hi @malton-ont , many thanks for the suggestion.

I added export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:25 before calling dorado duplex and I did not get the error this time!

Question: that Pytorch parameter should not affect the quality of results in any way... right? :-)

malton-ont commented 8 months ago

Hi @fayora, that's great to hear. No, this should have no effect on the results.