Closed jmtsuji closed 3 months ago
@jmtsuji We should profile the pipeline first to figure out the slowest steps and target those first. Medaka is one of the slower steps, but I don't think it's the slowest.
Yes, sounds good to profile the pipeline and figure out what steps are the slowest. Let's table this for now.
@jmtsuji We now have polishing by contig in https://github.com/rotary-genomics/rotary/pull/147
Addressed in https://github.com/rotary-genomics/rotary/pull/147
If we run into speed issues, we can look into doing sub-contig polishing, but that might be an over optimization.
A possible way to speed up rotary: according to the medaka GitHub repo, the consensus step of medaka is basically capped at 2 threads. It might make sense for us to implement the more advanced usage of medaka, as shown in the link above, to achieve maximum parallelization. Basically, we can use the
--regions
flag ofmedaka consensus
to run different regions of the input BAM file (e.g., different contigs, or 1 Mb chunks of contigs) through the consensus step in parallel.If we wanted to add this speed-up to rotary, we could either implement it directly in snakemake or by making a Python script that runs the medaka (including the consensus step) in parallel.
I don't think this is a high priority, but I wanted to post it here while it is on my mind.