Yang990-sys commented 1 month ago

Hello,

I am using Dorado v0.5.3 for RNA004 basecalling , but I frequently encounter issues with progress bars reaching 100% but memory not being released for a long time, lasting from 20 minutes to 5 hours. Is this a normal situation or a bug?

Memory is still occupied: The progress bar reached 100% 5 hours ago :

malton-ont commented 1 month ago

Hi @Yang990-sys,

Can you please fill out the issue template? In particular, the command you are using to run dorado would be very helpful here.

Yang990-sys commented 1 month ago

Issue Report

Please describe the issue:

As above

Steps to reproduce the issue:

copy pod5 and basecall

Run environment:

Dorado version: v0.5.3
Dorado command: soft/dorado-0.5.3-linux-x64/bin/dorado basecaller -r --resume-from ./pre.basecalled.bam --batchsize 3072 --estimate-poly-a -x "cuda:${cuda}" --modified-bases-models /data/user/jiangmian/soft/model/rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 /data/user/jiangmian/soft/model/rna004_130bps_sup@v3.0.1 ./pod5 > basecalled.bam
Operating system:
Hardware (CPUs, Memory, GPUs): 4 NVIDIA A800 80GB PCIe
Source data type : pod5
Source data location : NFS
Details about data: RNA004, about 30G fastq every time;
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): I don't know if small-scale data will encounter such problems, but I killed the program that ran to 100% pause and ran it again (theoretically, there is very little data that needs to continue running), and it will still pause at 100% for a long time.
Logs

———————————————— I guess it's because I set the batch size too large? The default batch size of the program can only use 85% of GPU memory. In order to speed up progress and not waste resources, the batch size is intentionally set to be too large, and the program will prompt to be too large, adjusting it to a value that can fill up memory without crashing. I don't know if this operation is meaningful and if it will cause the above bug?

malton-ont commented 1 month ago

Hi @Yang990-sys,

The default batch size is determined by performing a timing sweep of different batch sizes to determine the most efficient value - this may not be the same as the value that uses the maximum GPU memory (or we'd just have picked the max value as the default!). Having said that, I don't see how that would cause a long delay in shutdown, except that dorado needs to deallocate the memory at the end.

For a 13 hour run, that's a lot of data - 100% may not be exactly 100% and more processing may still be occurring. Are you able to run a process monitor such as htop? Additional processing such as the poly-a estimation occurs on the CPU, and so will not show as GPU activity.

Yang990-sys commented 1 month ago

You are right, even if the GPU usage is 0, there are still some threads running on the CPU; Thank you for your clarification!

nanoporetech / dorado

Basecall process 100% but memory is still occupied for a long time #795

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs