nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
441 stars 54 forks source link

Adding computational flexibility to `dorado correct` #846

Open sivico26 opened 1 month ago

sivico26 commented 1 month ago

Hello,

I am very excited to see herro integrated as dorado correct. A huge congrats on the usage simplification too.

I was in the middle of running herro with my data when dorado correct was released. I am done with the AvA mapping, and now I am struggling to allocate the GPU resources for the inference (but that is another story).

You perfectly know that the computational demands of dorado correct are very high. HPC clusters may have a few nodes with all the specs (Many CPUs, high memory, Capable GPUs, and high GPU memory), but those are highly demanded, making long queues for the multiple users that want to use them.

I was thinking that some computational flexibility could be added if we could decouple the correction as herro did: AvA and inference. This helps because both have very different spec requirements. While for AvA I only need a lot of threads for a long time (but almost no RAM or GPU), for inference I still need the big machine, but now it can have a more moderate amount of CPUs and certainly for a shorter time.

So do you think this is possible? Maybe the easiest way to achieve this is to enable dorado correct to accept the AvA output as input, but then you have to control for an appropriate input AvA input, which can be tricky. This also assumes that the steps can still be decoupled, which is something that should be doable unless you streamlined both steps to make it more efficient (I suspect this is what you did).

So I was wondering if you could walk us through what making herro production-ready implied, what are your plans for dorado correct, and what you think of this proposal.

Thanks in advance, and keep up the good work.

tijyojwad commented 1 month ago

Hi @sivico26 - thanks for using dorado correct and for providing feedback! Since ava is bottleneck, I agree that splitting up the stages provides for better resource management (even though not the best user experience). While we plan to keep addressing the alignment bottleneck and find ways to improve that speed, I plan to push a feature to separately run ava and inference. For now I'm thinking of going the same route as herro and provide options to dorado correct so save alignments and then run inference from those intermediate alignments. What do you think?

sivico26 commented 1 month ago

That sounds fantastic @tijyojwad! indeed I think the feature you have in mind is the way to go and I am happy to hear it is in the plans.

Particularly in my case, I do not think I will be able to wait for the feature release. But I am sure others (including future me), will enjoy it and give it a good use.