nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
441 stars 54 forks source link

CUDA error: out of memory on Dorado 0.7.0 with dna_r10.4.1_e8.2_400bps_sup@v5.0.0 #849

Closed VBHerrenC closed 1 month ago

VBHerrenC commented 1 month ago

Issue Report

Please describe the issue:

When running my DNA dataset with Dorado 0.7.0 with the dna_r10.4.1_e8.2_400bps_sup@v5.0.0 model, after about 8 minutes of basecalling a CUDA out of memory error is generated. This is odd because I was able to successfully basecall this dataset last week with Dorado 0.6.0 and the 0.4.3 model. Additionally, the basecalling appears to proceed normally when the same command is run with Dorado 0.7.0 but the 0.4.3 model. During troubleshooting with Dorado 0.7.0 and model 0.5.0, basecalling also completes with both --device cuda: all and --device cpu when --max-reads is set to 10. So, it seems to be an issue with the full dataset and the 0.5.0 model, perhaps an issue with how the batch sizes are being set? Any ideas would be appreciated!

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Logs

[2024-05-28 11:51:58.818] [info] Running: "basecaller" "/home/kyle/packages/dorado-0.7.0-linux-x64/models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0" "/home/pod5" "--kit-name" "SQK-RBK114-24" "--min-qscore" "14" "--trim" "all" "--device" "cuda:all" "--verbose"
[2024-05-28 11:51:58.835] [info] > Creating basecall pipeline
[2024-05-28 11:51:58.835] [debug] CRFModelConfig { qscale:1.050000 qbias:1.300000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:5000 mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:12288 overlap:600 batch_size:0} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}}
[2024-05-28 11:51:59.512] [debug] cuda:0 memory available: 24.19GB
[2024-05-28 11:51:59.512] [debug] cuda:0 memory limit 23.19GB
[2024-05-28 11:51:59.512] [debug] cuda:0 maximum safe estimated batch size at chunk size 12288 is 288
[2024-05-28 11:51:59.512] [debug] cuda:0 maximum safe estimated batch size at chunk size 6144 is 576
[2024-05-28 11:51:59.512] [debug] Auto batchsize cuda:0: testing up to 512 in steps of 32
[2024-05-28 11:51:59.700] [debug] Auto batchsize cuda:0: 32, time per chunk 1.367614 ms
[2024-05-28 11:51:59.851] [debug] Auto batchsize cuda:0: 64, time per chunk 1.080064 ms
[2024-05-28 11:52:00.036] [debug] Auto batchsize cuda:0: 96, time per chunk 0.876075 ms
[2024-05-28 11:52:00.272] [debug] Auto batchsize cuda:0: 128, time per chunk 0.848552 ms
[2024-05-28 11:52:00.576] [debug] Auto batchsize cuda:0: 160, time per chunk 0.837414 ms
[2024-05-28 11:52:00.916] [debug] Auto batchsize cuda:0: 192, time per chunk 0.825253 ms
[2024-05-28 11:52:01.304] [debug] Auto batchsize cuda:0: 224, time per chunk 0.811671 ms
[2024-05-28 11:52:01.744] [debug] Auto batchsize cuda:0: 256, time per chunk 0.805868 ms
[2024-05-28 11:52:02.232] [debug] Auto batchsize cuda:0: 288, time per chunk 0.798567 ms
[2024-05-28 11:52:02.778] [debug] Auto batchsize cuda:0: 320, time per chunk 0.803005 ms
[2024-05-28 11:52:03.379] [debug] Auto batchsize cuda:0: 352, time per chunk 0.800620 ms
[2024-05-28 11:52:04.039] [debug] Auto batchsize cuda:0: 384, time per chunk 0.807237 ms
[2024-05-28 11:52:04.749] [debug] Auto batchsize cuda:0: 416, time per chunk 0.798562 ms
[2024-05-28 11:52:05.510] [debug] Auto batchsize cuda:0: 448, time per chunk 0.798869 ms
[2024-05-28 11:52:06.327] [debug] Auto batchsize cuda:0: 480, time per chunk 0.795750 ms
[2024-05-28 11:52:07.191] [debug] Auto batchsize cuda:0: 512, time per chunk 0.791324 ms
[2024-05-28 11:52:07.211] [debug] Largest batch size for cuda:0: 512, time per chunk 0.791324 ms
[2024-05-28 11:52:07.211] [debug] Final batch size for cuda:0[0]: 288
[2024-05-28 11:52:07.211] [debug] Final batch size for cuda:0[1]: 512
[2024-05-28 11:52:07.211] [info] cuda:0 using chunk size 12288, batch size 288
[2024-05-28 11:52:07.211] [debug] cuda:0 Model memory 20.55GB
[2024-05-28 11:52:07.211] [debug] cuda:0 Decode memory 2.50GB
[2024-05-28 11:52:08.887] [info] cuda:0 using chunk size 6144, batch size 512
[2024-05-28 11:52:08.887] [debug] cuda:0 Model memory 18.27GB
[2024-05-28 11:52:08.887] [debug] cuda:0 Decode memory 2.22GB
[2024-05-28 11:52:25.000] [debug] BasecallerNode chunk size 12288
[2024-05-28 11:52:25.000] [debug] BasecallerNode chunk size 6144
[2024-05-28 11:52:25.011] [debug] Load reads from file /home/kyle/RNA_Barcodes_Project_Leftover_Plasmids/pVRN057-066_055_all-001/20240516_1633_MC-115154_FAX74146_36ffbf23/pod5/FAX74146_36ffbf23_ad6b6a0c_43.pod5
[2024-05-28 11:52:27.519] [debug] > Kits to evaluate: 1
[2024-05-28 12:00:05.912] [debug] Load reads from file /home/kyle/RNA_Barcodes_Project_Leftover_Plasmids/pVRN057-066_055_all-001/20240516_1633_MC-115154_FAX74146_36ffbf23/pod5/FAX74146_36ffbf23_ad6b6a0c_62.pod5
[2024-05-28 12:00:19.930] [warning] Caught Torch error 'CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f295d7719b7 in /home/kyle/packages/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2956cf6115 in /home/kyle/packages/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f295d73b958 in /home/kyle/packages/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0xa9e9def (0x7f295d722def in /home/kyle/packages/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0xa9f3ee7 (0x7f295d72cee7 in /home/kyle/packages/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0xa9f4387 (0x7f295d72d387 in /home/kyle/packages/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: /home/kyle/packages/dorado-0.7.0-linux-x64/bin/dorado() [0x465b0d]
frame #7: <unknown function> + 0x1196e380 (0x7f29646a7380 in /home/kyle/packages/dorado-0.7.0-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0x94ac3 (0x7f2951a3fac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f2951ad1850 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted
HalfPhoton commented 1 month ago

Hi @VBHerrenC, The V5 models use the new transformer architecture and there's still some work to do to tune the auto batch size calculation for a broader range of hardware.

We see that the auto batch size calculation has chosen 288 from [2024-05-28 11:52:07.211] [info] cuda:0 using chunk size 12288, batch size 288

Could you try and manually set this slightly lower to --batchsize 256 or 224 to reduce memory consumption slightly.

Kind regards, Rich

VBHarrisN commented 1 month ago

Even when using batchsize 256, the GPU is barely doing anything. All GPU processing monitors show no activity. While the model is no longer running out of VRAM, it still is not making any progress on computing. In addition, the batch size has also dropped to about half as the batchsize of the previous model.

VBHerrenC commented 1 month ago

It is moving extremely slowly (about 2x as slow as the 4.3.0 model, predicting 12 hours instead of 6), but it is writing out to the BAM file and hasn't crashed yet like the previous run without the smaller batch size. Thanks!

HalfPhoton commented 1 month ago

We're expecting it to be 2x slower at the moment - there's much more optimisation to come to bring it closer to v4.3.0 sup speed. Closing as resolved but we'll continue to improve performance and stability.