aringeri commented 3 months ago

Issue Report

Please describe the issue:

I am currently trialing using a home desktop GPU to speedup basecalling (over CPU hardware). The device I have access to is a NVIDIA GeForce GTX 1060 3GB which is considerably less powerful than the workstation GPUs. Hac and fast basecalling models are working well on this device but I run into issues when attempting to use the sup model.

For v4.3.0 models I get an 'out of memory' message. While on v5 models I get a segmentation fault.

Is there any way to configure the dorado to use less GPU memory to fit these lower spec device?

Steps to reproduce the issue:

v4.3.0 sup

dorado basecaller -r dorado-models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0/ \
  dna_r10.4.1_e8.2_400bps_5khz-FLO_PRO114M-SQK_LSK114_XL-5000.pod5 \
  > calls-sup.bam

v5.0.0 sup

dorado basecaller -r dorado-models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/ \
  dna_r10.4.1_e8.2_400bps_5khz-FLO_PRO114M-SQK_LSK114_XL-5000.pod5 \
  > calls-sup.bam

Run environment:

Dorado version: 0.7.2
Dorado command: dorado basecaller
Operating system: Ubuntu 22.04
Hardware (CPUs, Memory, GPUs): Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz, NVIDIA GeForce GTX 1060 3GB
Source data type: pod5
Source data location: on device
Details about data: 16KB
Dataset to reproduce: pod5 from tests

Logs

v4.3.0

[2024-06-27 05:38:07.573] [info] Running: "basecaller" "-v" "-r" "dorado-models/dna_r10.4.1_e8.2_400bps_sup@v4.3.0/" "dna_r10.4.1_e8.2_400bps_5khz-FLO_PRO114M-SQK_LSK114_XL-5000.pod5"
[2024-06-27 05:38:07.579] [info] Normalised: chunksize 10000 -> 9996
[2024-06-27 05:38:07.579] [info] Normalised: overlap 500 -> 498
[2024-06-27 05:38:07.579] [info] > Creating basecall pipeline
[2024-06-27 05:38:07.579] [debug] CRFModelConfig { qscale:1.050000 qbias:0.200000 stride:6 bias:0 clamp:1 out_features:-1 state_len:5 outsize:4096 blank_score:2.000000 scale:1.000000 num_features:1 sample_rate:5000 mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.379997 stdev:23.420000}} BasecallerParams { chunk_size:9996 overlap:498 batch_size:0} convs: { 0: ConvParams { insize:1 size:16 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:16 size:16 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:16 size:1024 winlen:19 stride:6 activation:tanh}} model_type: lstm { bias:0 outsize:4096 blank_score:2.000000 scale:1.000000}}
[2024-06-27 05:38:08.556] [debug] cuda:0 memory available: 2.91GB
[2024-06-27 05:38:08.556] [debug] cuda:0 memory limit 1.91GB
[2024-06-27 05:38:08.556] [warning] cuda:0 maximum safe estimated batch size at chunk size 9996 is only 64.
[2024-06-27 05:38:08.556] [debug] cuda:0 maximum safe estimated batch size at chunk size 4998 is 128
[2024-06-27 05:38:08.556] [debug] Auto batchsize cuda:0: testing up to 128 in steps of 64
[2024-06-27 05:38:12.458] [debug] Auto batchsize cuda:0: 64, time per chunk 30.365648 ms
[2024-06-27 05:38:20.077] [debug] Auto batchsize cuda:0: 128, time per chunk 29.739161 ms
[2024-06-27 05:38:20.084] [debug] Largest batch size for cuda:0: 128, time per chunk 29.739161 ms
[2024-06-27 05:38:20.084] [debug] Final batch size for cuda:0[0]: 128
[2024-06-27 05:38:20.084] [debug] Final batch size for cuda:0[1]: 128
[2024-06-27 05:38:20.084] [info] cuda:0 using chunk size 9996, batch size 128
[2024-06-27 05:38:20.084] [debug] cuda:0 Model memory 2.18GB
[2024-06-27 05:38:20.084] [debug] cuda:0 Decode memory 0.90GB
[2024-06-27 05:38:41.264] [error] CUDA out of memory. Tried to allocate 860.00 MiB (GPU 0; 2.93 GiB total capacity; 2.27 GiB already allocated; 497.69 MiB free; 2.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /pytorch/pyold/c10/cuda/CUDACachingAllocator.cpp:913 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x775b40e389b7 in /home/user/dorado-0.7.2-linux-x64/bin/../lib/libdorado_torch_lib.so)

v5.0.0

[2024-06-27 05:37:17.921] [info] Running: "basecaller" "-v" "-r" "dorado-models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0/" "dna_r10.4.1_e8.2_400bps_5khz-FLO_PRO114M-SQK_LSK114_XL-5000.pod5"
[2024-06-27 05:37:17.927] [info] > Creating basecall pipeline
[2024-06-27 05:37:17.928] [debug] CRFModelConfig { qscale:1.050000 qbias:1.300000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:5000 mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:12288 overlap:600 batch_size:0} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}}
[2024-06-27 05:37:19.310] [debug] cuda:0 memory available: 2.89GB
[2024-06-27 05:37:19.311] [debug] cuda:0 memory limit 1.89GB
[2024-06-27 05:37:19.311] [warning] cuda:0 maximum safe estimated batch size at chunk size 12288 is only 0.
[2024-06-27 05:37:19.311] [warning] cuda:0 maximum safe estimated batch size at chunk size 6144 is only 32.
Segmentation fault (core dumped)

nvidia-smi

| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
...

malton-ont commented 3 months ago

We recommend a minimum of 8GB of GPU RAM, so I'm not very surprised this card is struggling.

[2024-06-27 05:37:19.311] [warning] cuda:0 maximum safe estimated batch size at chunk size 12288 is only 0.

That probably explains the segfault.

The 5.0 sup models have a minimum batch size of 32, other models have a minimum size of 64. If you still see failures with these values then the other option is to reduce the --chunksize as well (I'd suggest by a factor of 2 each time?), though this may have an effect on accuracy.