nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
526 stars 63 forks source link

Dorado basecaller hangs at 99% completion #575

Open nglaszik opened 10 months ago

nglaszik commented 10 months ago

As the title says, running dorado basecaller (0.5.1) on (multiple or single) pod5 files hangs at 99% completion. The time indicated on the progress bar halts. No errors are output, and operating in verbose mode offers no additional information. The process can be terminated, and examination of the resultant bam file with dorado summary & nanopolish produces normal-looking results. The bam file doesn't have an EOF though, so it's unclear if there is more data that needs to be written. The input pod5's are converted by "pod5 convert fast5" from fast5s created by minknow.

It seems to only happen for particular pod5 files. Some files get to 100% completion. Hanging is perhaps more likely for larger files.

System: Ubuntu 20.04, NVIDIA RTX3090, Driver 470.223.02, CUDA 11.4

Edit: dorado 0.4.1 basecaller doesn't have this issue.

tijyojwad commented 10 months ago

Hi @nglaszik - can you post what command you're running? Are you able to check your GPU utilization (using nvidia-smi or nvtop when the run is stuck at 99%? Is it showing anything >0?

HalfPhoton commented 9 months ago

@nglaszik - are there any updates on this issue?

nglaszik commented 8 months ago

Hi there, sorry for just now getting back to this. I was able to recreate this error with an updated NVIDIA driver v550.54.14. The command I run is the following:

/home/dorado-0.5.1-linux-x64/bin/dorado basecaller --device cuda:0 /home/dorado_models/dna_r10.4.1_e8.2_400bps_hac@v4.2.0 /home/nanopore/data/231208_BrdU500_PGvM_2/output.pod5 > /home/nanopore/data/231208_BrdU500_PGvM_2/calls_0.5.1.bam

I did notice that different models will create different errors, e.g. the v4.2 5mCG_5hmCG model hangs after only a few seconds. Maybe it's a model incompatibility I'm overlooking? I also forgot to mention the data is coming from a R10.4 flow cell if that's relevant. I am now getting back to this project, so I can test some other permutations of models / versions of dorado since I think maybe the v4.3 models and dorado 5.3 weren't available when I was trying earlier.

I've just continued using the 0.4.1 basecaller for now.

Edit: I'll look at the GPU utilization as well.

nglaszik commented 8 months ago

Hi there,

Confirmed that 0.5.3 basecaller still hangs at 99% completion, producing a .bam file without an EOF, using the v4.3 hac model.

During the hang, nvidia-smi shows that the process is still running on the GPU, and taking up the same amount of memory as before the hang. However, volatile GPU-util is at 0% so it doesn't seem to be actually processing anything.

nglaszik commented 8 months ago

Another update: dorado basecaller 0.5.3 can run on smaller pod5 datasets...

If I split the original 46GB pod5 file into multiple pod5's, basecaller will still hang at 99% completion.

However, if I choose a subset (3 pod5's of 4000 reads each) to run basecaller on, it runs to completion.

Interestingly, the success of running basecaller for modified bases also seems to be dependent on the input file sizes... 5mCG_5hmCG can successfully run on a single pod5 of 4000 reads, whereas 5mC_5hmC freezes somewhere in the middle of completion. However, 5mC_5hmC runs to completion on a pod5 of 100 reads.

Sounds like a memory issue, perhaps related to optimization for other video cards? It's running on an NVIDIA RTX3090 with 24GB of memory, far below the 40 or 80 on an A100.

nglaszik commented 8 months ago

@tijyojwad - sorry to bug, but any insight into this? Especially the last post where dorado runs with smaller input pod5's but not with large ones?

tijyojwad commented 8 months ago

Hi @nglaszik - this is an odd situation and I don't have any obvious solution yet. It feels like it could be related to a specific offending read(s)...

One suggestion is to fetch the read ids from BAM of the hung run (remember to collect the read ids in the pi:Z tag as well for split reads) and compare it against the read ids in the pod5. Whichever pod5s read id (or ids) is missing is likely causing the issue.

nglaszik commented 8 months ago

Sounds good, thank you @tijyojwad I'll try that!

pre-mRNA commented 7 months ago

I'm having the same issue with Dorado 0.5.3 while performing RNA basecalling, using the command:

dorado basecaller sup,m6A_DRACH ./pod5/ --estimate-poly-a > ./unmapped.bam

My basecalling hangs at 100%. Similarly, the command works fine for individual pod5s, and the volatile GPU usage is at 0% during the hang.

In my case, killing guppy produces a truncated BAM file, but it seems that all the reads are basecalled.

tijyojwad commented 7 months ago

thanks for reporting @pre-mRNA - why is the size of your combined dataset?

pre-mRNA commented 7 months ago

2.98M RNA004 reads, across ~700 POD5 files

HalfPhoton commented 1 month ago

Does this issue persist in dorado-0.8.0?

Kind regards, Rich

pre-mRNA commented 1 month ago

Hi,

I still get similar errors with default multi-modification usage, but now it's stable once I specify the chunk/batch size, e.g.:

dorado basecaller sup,pseU,inosine_m6A,m5C "$pod5_dir/" --estimate-poly-a -r -b 416 -c 9216 --models-directory ./bin

For some reason, I also need to specifcy the model directory manually with the v0.8 update.

Cheers,


From: Richard Harris @.> Sent: Tuesday, 17 September 2024 12:24 PM To: nanoporetech/dorado @.> Cc: Aditya Sethi @.>; Mention @.> Subject: Re: [nanoporetech/dorado] Dorado basecaller hangs at 99% completion (Issue #575)

Does this issue persist in dorado-0.8.0?

Kind regards, Rich

— Reply to this email directly, view it on GitHubhttps://github.com/nanoporetech/dorado/issues/575#issuecomment-2355222168, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AVAYTRNGGHOPTSTU77B27BTZW77M7AVCNFSM6AAAAABBVHJ6AKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJVGIZDEMJWHA. You are receiving this because you were mentioned.Message ID: @.***>

HalfPhoton commented 1 month ago

@pre-mRNA - Ok, I'll keep this issue open as the underlying problem isn't resolved.

For some reason, I also need to specifcy the model directory manually with the v0.8 update.

Models should be downloaded into the cwd and cleaned up if the models-directory isn't set. Are you having trouble?

pre-mRNA commented 1 month ago

Models should be downloaded into the cwd and cleaned up if the models-directory isn't set. Are you having trouble?

My GPU nodes aren't internet connected, so I need to specify the model directory manually, even when the models are already present in cwd.

I think Dorado 0.7 automatically searched cwd for models, whereas now the path needs to be set.

Not a big deal.