nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
439 stars 53 forks source link

Load reads from file - takes longer than expected #901

Open phpeters opened 1 week ago

phpeters commented 1 week ago

Issue Report

A subset of my pod5 files of a specific run take 6x longer to basecall than normally / the other files. This was a promethION run and didn't encounter on another run so far.

Description:

To speed up basecalling even more, I split up my pod5 files into chunks and process those chunks in parallel. Usually the process "Load reads from file" takes about 1 hour per pod5 but for a subset, each file needs ~6hours to be loaded.

The pod5-files in question are all ~25-30 Gb in size and what I have seen from intermediate summary files (basecalling is still ongoing) have all a similar read-length distribution (read-length N50 ~15kb). It appears though that mostly the first 15 files are affected. The dorado logs don't show warnings or errors, it simply takes a lot more time

Steps to reproduce the issue:

Simply using the dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v5.0.0 pod5/ command

Run environment:

Thanks and all the best! Philipp

HalfPhoton commented 6 days ago

Hi @phpeters,

Loading reads should be very quick and 1hr per file is very unusual let alone 6hrs.

Seeing as you have poor performance loading 1 file in 1 process and get much worse performance loading many files in many processes on your HPC, this leads me to think that IO on your HPC is the bottleneck.

I suggest that you look into if your HPC / workflow is accessing the data and if the IO throughput is sufficient to work efficiently.

Kind regards, Rich

phpeters commented 5 days ago

Hej @HalfPhoton ,

Thanks for looking into this! I agree that it should be much quicker, so I checked previous runs and there, loading times were approx. 0.5 - 3 minutes per pod5-file. I also tested the new dorado version and model v5 against such previous run and I got a quick loading of the files. Thus, the basecaller seems to be fine as well as the HPC (since all runs are stored on the same partition and running on the same node right now, I get long and short loading times).

Both datasets were generated with a promethION, the previous one with MinKNOW 23.11.7 (Bream 7.8.2, Core 5.8.6), the current slow one with MinKNOW 24.02.10 (Bream 7.9.4, Core 5.9.7). Maybe it's the minknow update?

If you have a box to drop, I could send you an example pod5 file. Thanks and all the best! Philipp

iiSeymour commented 5 days ago

@phpeters I'm pretty sure this not down to the pod5 files and/or the software versions used to generate them. Can you report the GPU used for both runs and the reported speed in sample/s from dorado.

If, for example, the slower run was called on V100 and the read lengths are longer vs a run on A100 with shorter read lengths this could explain an 6x difference in relative performance. The reported sample/s on a given GPU will account for this (please also tell us if the number of GPUs used by dorado changes i.e 2x V100 vs 4x A100).