GabeAl commented 4 hours ago

Issue Report

Please describe the issue:

I took a large pod5 file produced by MinKNOW (after skipping catch-up basecalling from a P2 solo run). I tried basecalling using dorado 0.8.2 and 0.8.1 on both a Linux and a Windows system. The GPUs are a bit different, so this may also complicate figuring out the root cause.

Steps to reproduce the issue:

Prepare libraries with SQK-RBK114-24
Acquire data in MinKNOW with basecalling. Skip any basecalling that didn't finish.
Grab any of the pod5 files labeled skip. They're about 64GB each. Here I'll grab _5.
Run: dorado.exe basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
In Linux, it uses all the available VRAM (on a 3090 Ti, this was 24GB). On Windows, it keeps growing without stopping. It never once decreased.

System memory keeps increasing until 64GB of system memory is used up in addition to the 16GB of VRAM on the RTX 5000 Ada (80GB total, which is the max Windows allows as "total GPU memory" in my system). When the system finally runs out of memory, it shows an error about CUDA gemm functions not allocating, mentions it's trying to clear the CUDA cache and try again, but instead dies or locks up horribly.

The behavior persists with --no-trim
The behavior persists in the Linux x64 version run in WSL.
The behavior persists with BAM output (without --emit-fastq).
The behavior DOES NOT EXIST with 'hac'. Only 'sup' is affected.

Please list any steps to reproduce the issue.

Run environment:

Dorado version: 0.8.1 and 0.8.2. (0.8.0 crashes silently after producing ~70MB of output, no matter what batch size parameters chosen. It does not run out of RAM, but just crashes with no error even in -vv).
Dorado command: dorado.exe basecaller -v --emit-fastq -b 32 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
Operating system: Windows 11 23H2

Hardware (CPUs, Memory, GPUs):


CPU

13th Gen Intel(R) Core(TM) i9-13950HX

Base speed: 2.20 GHz
Sockets:    1
Cores:  24
Logical processors: 32
Virtualization: Enabled
L1 cache:   2.1 MB
L2 cache:   32.0 MB
L3 cache:   36.0 MB

Utilization 11%
Speed   1.73 GHz
Up time 0:00:37:40
Processes   309
Threads 5179
Handles 133275

Memory

128 GB

Speed:  3600 MT/s
Slots used: 4 of 4
Form factor:    SODIMM
Hardware reserved:  344 MB

Available   101 GB
Cached  19.9 GB
Committed   43/136 GB
Paged pool  822 MB
Non-paged pool  1.3 GB
In use (Compressed) 26.0 GB (0.2 MB)

GPU 1

NVIDIA RTX 5000 Ada Generation Laptop GPU

Driver version: 32.0.15.6094
Driver date:    8/14/2024
DirectX version:    12 (FL 12.1)
Physical location:  PCI bus 1, device 0, function 0

Utilization 100%
Dedicated GPU memory    15.5/16.0 GB
Shared GPU memory   38.2/63.8 GB
GPU Memory  53.7/79.8 GB

(Before the run, GPU memory is basically unused, just 0.5/16.0 GB, and shared is 0.2/63.8GB).


- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
pod5 from MinKnow (bla_bla_skipped_5.pod5)
- Source data location (on device or networked drive - NFS, etc.): 
Local SSD
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 
enzyme 8.2.1, kit 14 (latest chemistry and kit and pore), read lengths N50 ~6mb, unknown total read number; total pod5 size: 64GB. (There are multiple 64GB pod5 files, but I'm trying one at a time).

- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Cannot reproduce on small pod5. In fact, that seems to be the problem, as it runs fine up until it runs out of RAM. I cannot split the pod5 due to apparent bugs in pod5 format making it impossible to split a large pod5 into smaller chunks [edit: see below, I try this anyway and it works, but a folder of split files does not]. This is the original data from MinKNOW, not something fenagled by me or converted from other formats, so there is no option to "regenerate" the pod5 files using a different splitting criteria. (I had instructed MinKnow to split by number of reads, but that apparently doesn't apply to the _skip files, only the basecalled ones). 

## Logs

* Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
Log with -v provided from 0.8.2 (0.8.1 produces a very similar log).

dorado basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --ou tput-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 [2024-10-26 14:40:29.669] [info] Running: "basecaller" "-v" "--emit-fastq" "-b" "96" "--kit-name" "SQK-RBK114-24" "--output-dir" "basecalled/" "sup" "pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5" [2024-10-26 14:40:30.061] [info] - Note: FASTQ output is not recommended as not all data can be preserved. [2024-10-26 14:40:30.206] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0 with httplib [2024-10-26 14:40:43.499] [info] > Creating basecall pipeline [2024-10-26 14:40:43.513] [debug] CRFModelConfig { qscale:1.050000 qbias:1.300000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:12288 overlap:600 batch_size:96} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}} [2024-10-26 14:40:44.257] [debug] TxEncoderStack: use_koi_tiled false. [2024-10-26 14:40:46.339] [debug] cuda:0 memory available: 15.55GB [2024-10-26 14:40:46.339] [debug] cuda:0 memory limit 14.55GB [2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 12288 is 160 [2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 6144 is 352 [2024-10-26 14:40:46.339] [info] cuda:0 using chunk size 12288, batch size 96 [2024-10-26 14:40:46.339] [debug] cuda:0 Model memory 6.85GB [2024-10-26 14:40:46.339] [debug] cuda:0 Decode memory 0.83GB [2024-10-26 14:40:48.518] [info] cuda:0 using chunk size 6144, batch size 96 [2024-10-26 14:40:48.518] [debug] cuda:0 Model memory 3.43GB [2024-10-26 14:40:48.518] [debug] cuda:0 Decode memory 0.42GB [2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 12288 [2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 6144 [2024-10-26 14:40:48.943] [debug] Load reads from file pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 [2024-10-26 14:40:49.664] [debug] > Kits to evaluate: 1 [2024-10-26 14:41:38.731] [debug] Invalid trim interval for read id 9b72d1ed-67b1-4e59-a6a1-7bf8d0fb9762: 117-117. Trimming will be skipped. [2024-10-26 14:43:21.296] [debug] Invalid trim interval for read id 40d2e852-4b32-46e5-8de0-238af8076f28: 118-113. Trimming will be skipped. [2024-10-26 14:44:42.855] [debug] Invalid trim interval for read id c642511c-7b9f-477c-82ba-7df1b07bc42c: 115-112. Trimming will be skipped. ...


At the very end as the memory is completely exhausted (80GB used of VRAM + Shared VRAM), it prints something like CUDA kernel couldn't allocate for gemm_something... then the display completely locks up (hard freeze). 

Is there a way to split a 64GB pod5 file produced by MinKNOW? ([pod5 subset is a bit broken](https://github.com/nanoporetech/pod5-file-format/issues/109)) I can probably work through this glitch if I can just split this thing into a few hundred parts (each just small enough to fit under the 80GB VRAM without crashing). My best run so far had a 350MB fastq output (with -b 32), but it won't accept smaller values of "-b".

GabeAl commented 3 hours ago

Currently trying pod5 subset anyway.

Here's how.

basecall using hac. I know it's wasteful, but I need some way to split. Test basecalled file in "test.fq".
printf "read_id\tbarcode\n" > map.tsv; sed -n '1~4p' test.fq | grep -F 'barcode' | sed 's/\t.*_barcode/\tbarcode/' | sed 's/\tDS.*//' | cut -c2- >> map.tsv <-- this is a tsv mapping file pod5 subset expects.
pod5 subset ../pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 --columns barcode --table map.tsv --threads 1 --missing-ok <-- this spits out a pod5 file per barcode.

This is slow but at least performs some subsetting. If any barcode is still too big, I'll try to add a new column trivially ( awk -F'\t' 'BEGIN{OFS=FS} {print $0, int(NR/1000)}') to split it into 1000-record chunks and call subset on that column.

All-in-one to make a tsv where you can split on barcode, raw batch, or batch-within-barcode ("barcodeBatch"): cat test.fq | awk -F'\t' 'BEGIN{print "read_id\tbarcode\tchunk\tbarcodeBatch"; OFS=FS} NR%4==1 && /barcode/ {sub(/^@/,"",$1); match($3,/barcode[0-9]+/,m); print $1,m[0],int(++c/1000),m[0] "-" int(b[m[0]]/1000); ++b[m[0]]}' > map.tsv

But if it works, this is a workaround. The bug is that 'sup' doesn't know how to evict context or old stuff from its memory. This is a modern Ada generation GPU with 16GB of VRAM. It's one of the most common modern GPUs in mobile workstations for AI/ML. It is also among the most performant and efficient. It would make sense to support it.

GabeAl commented 27 minutes ago

Another update:

Splitting them (in bunches of 10,000 reads per barcode), and then calling dorado sup on each individually, works like a charm. Notably, when this happens: The RAM does not continue to rise, and fluctuates nicely over a constant value as expected, for the entire duration of the run. RAM use at end of run (~10 minutes) is the same as at the beginning (+10 seconds).
Also notably, when dorado is pointed to a directory containing all of the pod5 files, the aberrant behavior immediately recurs. Dorado adds the contents of subsequent files in alphabetical order, but even before the second file is added, the RAM has already climbed shockingly high (much more so than when run on the first file alone).
Most shockingly of all, the behavior is noticeable within the first minute of runtime -- the RAM starts to climb at a constant rate. However, even starting from the same file, the dorado run on the "whole directory" of split files immediately misbehaves, showing the unmistakable RAM sinkhole behavior within 1-2 minutes, whereas no such behavior occurs even 5 minutes into the same file when split. Two completely different behaviors are seen with the same settings, parameters, and input reads, depending on whether or not there is a large total number of reads, even if the reads "so far" in a run are exactly the same.
This suggests there is simply a problem counting items. When filesize is large, something immediately overflows (RAM climbs without limit). When filesize is small, there is no overflow, and RAM remains constant throughout the entire run. The sinkhole occurs even when dorado has seen exactly the same reads in exactly the same order as the single subset file, implying that a parameter governing future behavior (total reads, total allowable padding, something dependent on the filesize) is what's messing things up. Perhaps there is a "context window" used by the transformer that is pre-initialized to the entire sequence space that needs to be reined in. Or an int32 instead of an int64 in CUDA defaults in Windows, resulting in arithmetic underflow. Etc

Hopefully these observations will help you fix the bug.

nanoporetech / dorado

Dorado >= 0.8.1 exhausts all VRAM, then all system memory, even with tiny -b, on Windows (large pod5+barcodes) #1103

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment: