Open GabeAl opened 4 hours ago
Currently trying pod5 subset anyway.
Here's how.
printf "read_id\tbarcode\n" > map.tsv; sed -n '1~4p' test.fq | grep -F 'barcode' | sed 's/\t.*_barcode/\tbarcode/' | sed 's/\tDS.*//' | cut -c2- >> map.tsv
<-- this is a tsv mapping file pod5 subset expects.pod5 subset ../pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 --columns barcode --table map.tsv --threads 1 --missing-ok
<-- this spits out a pod5 file per barcode. This is slow but at least performs some subsetting. If any barcode is still too big, I'll try to add a new column trivially ( awk -F'\t' 'BEGIN{OFS=FS} {print $0, int(NR/1000)}'
) to split it into 1000-record chunks and call subset on that column.
All-in-one to make a tsv where you can split on barcode, raw batch, or batch-within-barcode ("barcodeBatch"):
cat test.fq | awk -F'\t' 'BEGIN{print "read_id\tbarcode\tchunk\tbarcodeBatch"; OFS=FS} NR%4==1 && /barcode/ {sub(/^@/,"",$1); match($3,/barcode[0-9]+/,m); print $1,m[0],int(++c/1000),m[0] "-" int(b[m[0]]/1000); ++b[m[0]]}' > map.tsv
But if it works, this is a workaround. The bug is that 'sup' doesn't know how to evict context or old stuff from its memory. This is a modern Ada generation GPU with 16GB of VRAM. It's one of the most common modern GPUs in mobile workstations for AI/ML. It is also among the most performant and efficient. It would make sense to support it.
Another update:
Hopefully these observations will help you fix the bug.
Issue Report
Please describe the issue:
I took a large pod5 file produced by MinKNOW (after skipping catch-up basecalling from a P2 solo run). I tried basecalling using dorado 0.8.2 and 0.8.1 on both a Linux and a Windows system. The GPUs are a bit different, so this may also complicate figuring out the root cause.
Steps to reproduce the issue:
dorado.exe basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
System memory keeps increasing until 64GB of system memory is used up in addition to the 16GB of VRAM on the RTX 5000 Ada (80GB total, which is the max Windows allows as "total GPU memory" in my system). When the system finally runs out of memory, it shows an error about CUDA gemm functions not allocating, mentions it's trying to clear the CUDA cache and try again, but instead dies or locks up horribly.
Please list any steps to reproduce the issue.
Run environment:
Dorado version: 0.8.1 and 0.8.2. (0.8.0 crashes silently after producing ~70MB of output, no matter what batch size parameters chosen. It does not run out of RAM, but just crashes with no error even in -vv).
Dorado command:
dorado.exe basecaller -v --emit-fastq -b 32 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
Operating system: Windows 11 23H2
Hardware (CPUs, Memory, GPUs):
Memory
GPU 1
(Before the run, GPU memory is basically unused, just 0.5/16.0 GB, and shared is 0.2/63.8GB).
dorado basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --ou tput-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 [2024-10-26 14:40:29.669] [info] Running: "basecaller" "-v" "--emit-fastq" "-b" "96" "--kit-name" "SQK-RBK114-24" "--output-dir" "basecalled/" "sup" "pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5" [2024-10-26 14:40:30.061] [info] - Note: FASTQ output is not recommended as not all data can be preserved. [2024-10-26 14:40:30.206] [info] - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0 with httplib [2024-10-26 14:40:43.499] [info] > Creating basecall pipeline [2024-10-26 14:40:43.513] [debug] CRFModelConfig { qscale:1.050000 qbias:1.300000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:0.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:12288 overlap:600 batch_size:96} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}} [2024-10-26 14:40:44.257] [debug] TxEncoderStack: use_koi_tiled false. [2024-10-26 14:40:46.339] [debug] cuda:0 memory available: 15.55GB [2024-10-26 14:40:46.339] [debug] cuda:0 memory limit 14.55GB [2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 12288 is 160 [2024-10-26 14:40:46.339] [debug] cuda:0 maximum safe estimated batch size at chunk size 6144 is 352 [2024-10-26 14:40:46.339] [info] cuda:0 using chunk size 12288, batch size 96 [2024-10-26 14:40:46.339] [debug] cuda:0 Model memory 6.85GB [2024-10-26 14:40:46.339] [debug] cuda:0 Decode memory 0.83GB [2024-10-26 14:40:48.518] [info] cuda:0 using chunk size 6144, batch size 96 [2024-10-26 14:40:48.518] [debug] cuda:0 Model memory 3.43GB [2024-10-26 14:40:48.518] [debug] cuda:0 Decode memory 0.42GB [2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 12288 [2024-10-26 14:40:48.897] [debug] BasecallerNode chunk size 6144 [2024-10-26 14:40:48.943] [debug] Load reads from file pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 [2024-10-26 14:40:49.664] [debug] > Kits to evaluate: 1 [2024-10-26 14:41:38.731] [debug] Invalid trim interval for read id 9b72d1ed-67b1-4e59-a6a1-7bf8d0fb9762: 117-117. Trimming will be skipped. [2024-10-26 14:43:21.296] [debug] Invalid trim interval for read id 40d2e852-4b32-46e5-8de0-238af8076f28: 118-113. Trimming will be skipped. [2024-10-26 14:44:42.855] [debug] Invalid trim interval for read id c642511c-7b9f-477c-82ba-7df1b07bc42c: 115-112. Trimming will be skipped. ...