rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Better parallelization for medaka #131

Closed jmtsuji closed 3 months ago

jmtsuji commented 5 months ago

A possible way to speed up rotary: according to the medaka GitHub repo, the consensus step of medaka is basically capped at 2 threads. It might make sense for us to implement the more advanced usage of medaka, as shown in the link above, to achieve maximum parallelization. Basically, we can use the --regions flag of medaka consensus to run different regions of the input BAM file (e.g., different contigs, or 1 Mb chunks of contigs) through the consensus step in parallel.

If we wanted to add this speed-up to rotary, we could either implement it directly in snakemake or by making a Python script that runs the medaka (including the consensus step) in parallel.

I don't think this is a high priority, but I wanted to post it here while it is on my mind.

LeeBergstrand commented 5 months ago

@jmtsuji We should profile the pipeline first to figure out the slowest steps and target those first. Medaka is one of the slower steps, but I don't think it's the slowest.

jmtsuji commented 5 months ago

Yes, sounds good to profile the pipeline and figure out what steps are the slowest. Let's table this for now.

LeeBergstrand commented 3 months ago

@jmtsuji We now have polishing by contig in https://github.com/rotary-genomics/rotary/pull/147

LeeBergstrand commented 3 months ago

Addressed in https://github.com/rotary-genomics/rotary/pull/147

If we run into speed issues, we can look into doing sub-contig polishing, but that might be an over optimization.