nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
515 stars 62 forks source link

Dorado >= 0.8.1 exhausts all VRAM, then all system memory, even with tiny -b, on Windows (large pod5+barcodes) #1103

Open GabeAl opened 4 hours ago

GabeAl commented 4 hours ago

Issue Report

Please describe the issue:

I took a large pod5 file produced by MinKNOW (after skipping catch-up basecalling from a P2 solo run). I tried basecalling using dorado 0.8.2 and 0.8.1 on both a Linux and a Windows system. The GPUs are a bit different, so this may also complicate figuring out the root cause.

Steps to reproduce the issue:

  1. Prepare libraries with SQK-RBK114-24
  2. Acquire data in MinKNOW with basecalling. Skip any basecalling that didn't finish.
  3. Grab any of the pod5 files labeled skip. They're about 64GB each. Here I'll grab _5.
  4. Run: dorado.exe basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
  5. In Linux, it uses all the available VRAM (on a 3090 Ti, this was 24GB). On Windows, it keeps growing without stopping. It never once decreased.

System memory keeps increasing until 64GB of system memory is used up in addition to the 16GB of VRAM on the RTX 5000 Ada (80GB total, which is the max Windows allows as "total GPU memory" in my system). When the system finally runs out of memory, it shows an error about CUDA gemm functions not allocating, mentions it's trying to clear the CUDA cache and try again, but instead dies or locks up horribly.

Please list any steps to reproduce the issue.

Run environment:

Memory

128 GB

Speed:  3600 MT/s
Slots used: 4 of 4
Form factor:    SODIMM
Hardware reserved:  344 MB

Available   101 GB
Cached  19.9 GB
Committed   43/136 GB
Paged pool  822 MB
Non-paged pool  1.3 GB
In use (Compressed) 26.0 GB (0.2 MB)

GPU 1

NVIDIA RTX 5000 Ada Generation Laptop GPU

Driver version: 32.0.15.6094
Driver date:    8/14/2024
DirectX version:    12 (FL 12.1)
Physical location:  PCI bus 1, device 0, function 0

Utilization 100%
Dedicated GPU memory    15.5/16.0 GB
Shared GPU memory   38.2/63.8 GB
GPU Memory  53.7/79.8 GB

(Before the run, GPU memory is basically unused, just 0.5/16.0 GB, and shared is 0.2/63.8GB).


- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
pod5 from MinKnow (bla_bla_skipped_5.pod5)
- Source data location (on device or networked drive - NFS, etc.): 
Local SSD
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 
enzyme 8.2.1, kit 14 (latest chemistry and kit and pore), read lengths N50 ~6mb, unknown total read number; total pod5 size: 64GB. (There are multiple 64GB pod5 files, but I'm trying one at a time).

- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Cannot reproduce on small pod5. In fact, that seems to be the problem, as it runs fine up until it runs out of RAM. I cannot split the pod5 due to apparent bugs in pod5 format making it impossible to split a large pod5 into smaller chunks [edit: see below, I try this anyway and it works, but a folder of split files does not]. This is the original data from MinKNOW, not something fenagled by me or converted from other formats, so there is no option to "regenerate" the pod5 files using a different splitting criteria. (I had instructed MinKnow to split by number of reads, but that apparently doesn't apply to the _skip files, only the basecalled ones). 

## Logs

* Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
Log with -v provided from 0.8.2 (0.8.1 produces a very similar log).

dorado basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --ou tput-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 [2024-10-26 14:40:29.669] [info] Running: "basecaller" "-v" "--emit-fastq" "-b" "96" "--kit-name" "SQK-RBK114-24" "--output-dir" "basecalled/" "sup" "pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5" [2024-10-26 14:40:30.061] [info] - Note: FASTQ output is not recommended as not all data can be preserved. [2024-10-26 14:40:30.206] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0 with httplib [2024-10-26 14:40:43.499] [info] > Creating basecall pipeline [2024-10-26 14:40:43.513] [debug] CRFModelConfig { qscale:1.050000 qbias:1.300000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:12288 overlap:600 batch_size:96} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}} [2024-10-26 14:40:44.257] [debug] TxEncoderStack: use_koi_tiled false. [2024-10-26 14:40:46.339] [debug] cuda:0 memory available: 15.55GB [2024-10-26 14:40:46.339] [debug] cuda:0 memory limit 14.55GB [2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 12288 is 160 [2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 6144 is 352 [2024-10-26 14:40:46.339] [info] cuda:0 using chunk size 12288, batch size 96 [2024-10-26 14:40:46.339] [debug] cuda:0 Model memory 6.85GB [2024-10-26 14:40:46.339] [debug] cuda:0 Decode memory 0.83GB [2024-10-26 14:40:48.518] [info] cuda:0 using chunk size 6144, batch size 96 [2024-10-26 14:40:48.518] [debug] cuda:0 Model memory 3.43GB [2024-10-26 14:40:48.518] [debug] cuda:0 Decode memory 0.42GB [2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 12288 [2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 6144 [2024-10-26 14:40:48.943] [debug] Load reads from file pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 [2024-10-26 14:40:49.664] [debug] > Kits to evaluate: 1 [2024-10-26 14:41:38.731] [debug] Invalid trim interval for read id 9b72d1ed-67b1-4e59-a6a1-7bf8d0fb9762: 117-117. Trimming will be skipped. [2024-10-26 14:43:21.296] [debug] Invalid trim interval for read id 40d2e852-4b32-46e5-8de0-238af8076f28: 118-113. Trimming will be skipped. [2024-10-26 14:44:42.855] [debug] Invalid trim interval for read id c642511c-7b9f-477c-82ba-7df1b07bc42c: 115-112. Trimming will be skipped. ...


At the very end as the memory is completely exhausted (80GB used of VRAM + Shared VRAM), it prints something like CUDA kernel couldn't allocate for gemm_something... then the display completely locks up (hard freeze). 

Is there a way to split a 64GB pod5 file produced by MinKNOW? ([pod5 subset is a bit broken](https://github.com/nanoporetech/pod5-file-format/issues/109)) I can probably work through this glitch if I can just split this thing into a few hundred parts (each just small enough to fit under the 80GB VRAM without crashing). My best run so far had a 350MB fastq output (with -b 32), but it won't accept smaller values of "-b".
GabeAl commented 3 hours ago

Currently trying pod5 subset anyway.

Here's how.

  1. basecall using hac. I know it's wasteful, but I need some way to split. Test basecalled file in "test.fq".
  2. printf "read_id\tbarcode\n" > map.tsv; sed -n '1~4p' test.fq | grep -F 'barcode' | sed 's/\t.*_barcode/\tbarcode/' | sed 's/\tDS.*//' | cut -c2- >> map.tsv <-- this is a tsv mapping file pod5 subset expects.
  3. pod5 subset ../pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 --columns barcode --table map.tsv --threads 1 --missing-ok <-- this spits out a pod5 file per barcode.

This is slow but at least performs some subsetting. If any barcode is still too big, I'll try to add a new column trivially ( awk -F'\t' 'BEGIN{OFS=FS} {print $0, int(NR/1000)}') to split it into 1000-record chunks and call subset on that column.

All-in-one to make a tsv where you can split on barcode, raw batch, or batch-within-barcode ("barcodeBatch"): cat test.fq | awk -F'\t' 'BEGIN{print "read_id\tbarcode\tchunk\tbarcodeBatch"; OFS=FS} NR%4==1 && /barcode/ {sub(/^@/,"",$1); match($3,/barcode[0-9]+/,m); print $1,m[0],int(++c/1000),m[0] "-" int(b[m[0]]/1000); ++b[m[0]]}' > map.tsv

But if it works, this is a workaround. The bug is that 'sup' doesn't know how to evict context or old stuff from its memory. This is a modern Ada generation GPU with 16GB of VRAM. It's one of the most common modern GPUs in mobile workstations for AI/ML. It is also among the most performant and efficient. It would make sense to support it.

GabeAl commented 27 minutes ago

Another update:

Hopefully these observations will help you fix the bug.