nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
477 stars 59 forks source link

After --resume-from, the speed of basecalling dropped dramatically!! #665

Closed VahidJavaran closed 6 months ago

VahidJavaran commented 6 months ago

Dear Dorado Development Team, I hope this message finds you well. I am reaching out to seek your expertise and assistance regarding an issue I've encountered with Dorado basecalling software. I have been utilizing Dorado for basecalling Oxford Nanopore sequencing data, specifically using the SUP model for my POD5 data, which is approximately 130GB in size. Initially, the basecalling process was proceeding at an expected speed, leveraging the capabilities of my NVIDIA RTX A4000 GPU. Unfortunately, due to an unforeseen electricity cut, the basecalling run was abruptly stopped. After resolving an error related to NVIDIA drivers and updating them, I resumed the basecalling process. However, I observed a significant decrease in the speed of basecalling post-resumption. Despite the nvidia-smi command confirming that Dorado is indeed utilizing the GPU with high utilization levels (100% GPU-Util and a temperature of 89°C), the throughput is notably lower than before the interruption. Here are some specifics of my setup and the observed behavior: NVIDIA Driver Version: 535.161.07 CUDA Version: 12.2 GPU Model: NVIDIA RTX A4000 Initial basecalling speed was as expected, but significantly slowed down after resuming. Given these circumstances, I am curious if there are known issues or considerations when resuming a basecalling run after an interruption that could lead to a decrease in processing speed. Additionally, I would appreciate any suggestions or recommendations you may have for troubleshooting this issue or optimizing the resumption process to ensure the basecalling can proceed at its optimal speed.

tijyojwad commented 6 months ago

Hi @VahidJavaran - thanks for raising the issue!

How large is the BAM file that you're resuming from? What do you see in the stderr after basecalling starts? Do you see the progress bar that reports time elapsed? Or do you see the progress bar that shows the reads are still being resumed from?

Can you also comment on what kind of slowdown you're seeing? e.g. how much slower this is running compared to your earlier run?

VahidJavaran commented 6 months ago

Hi @tijyojwad Thank you for your response. Based on the initial basecalling process, it was expected to complete within 25 hours. However, after resuming the basecalling—having already completed 17 out of the 25 hours—I noticed an issue. The process seemed to stall for almost an hour while attempting to resume from the BAM file, which was 11.7 GB in size at the time. Subsequently, it jumped to 73% completion and remained stuck there for another hour. Fortunately, after this delay, the operation proceeded normally, and the total basecalling time aligned with the initial estimate of 25 hours, despite the need to resume the process. It's worth noting that the final BAM file size was 15.2 GB.

tijyojwad commented 6 months ago

Hi @VahidJavaran - good to hear that you were able to successfully resume your basecall!

process seemed to stall for almost an hour while attempting to resume from the BAM file

During this step dorado is going through the resume file and determining which reads have been processed already. We typically output a progress bar for this to stderr but if your stderr is piped to a file we skip that. I'll put in a change to also log a simple message indicating that.

The speed of that loading is mainly determined by IO. If the file is on a local storage with high read speed, then that time can go down substantially.

Is there anything else outstanding or can the issue be closed?

VahidJavaran commented 6 months ago

Thanks for your explanation! There are no further concerns. Dorado worked well for basecalling and methylation calling for me.