rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Optimize medaka parallelization #147

Closed LeeBergstrand closed 3 months ago

LeeBergstrand commented 3 months ago

@jmtsuji This is a bodge that partially addresses https://github.com/rotary-genomics/rotary/issues/131. After some testing it only makes the code 1% faster (code goes to 2 threads for the longest contig in the end). If you had a genome with lots of big contigs (multiple chromosomes) or multiple genomes then there would be greater speed ups.

@jmtsuji This code works. Can you review it to make sure I'm not making any mistakes in the overall logic?

Right now I am work on another branch https://github.com/rotary-genomics/rotary/tree/optimize_medaka_parallelization_by_contig to make it so it polishes each contig in a separate rule instanced spawned by a checkpoint reevaluation of the DAG after it figures out how what contigs were produced by the assembly. This is cleaner and more scalable (the schedular can interweave polishing across contigs and organisms if each polishing rule is set up as taking two threads) but a little harder to right code for.


optimize_medaka_parallelization_by_contig has been merged in.

The code now uses a checkpoint and aggregating function to dynamically run medaka polishing by contig.

This should provide significant speed-ups when running multiple genomes or large chromosomes simultaneously.

LeeBergstrand commented 3 months ago

@jmtsuji Added checkpoint-based code for running medaka polish per contig.

LeeBergstrand commented 3 months ago

@jmtsuji Update code to use a text file containing the names of the contigs.

LeeBergstrand commented 3 months ago

@jmtsuji, https://github.com/rotary-genomics/rotary/issues/131 should be fully addressed. I'm looking forward to your review. I haven't benchmarked it yet, but it feels like Medaka now goes much faster with input sample multiple genomes.

LeeBergstrand commented 3 months ago

@jmtsuji I also wrote better documentation and cleaned up the function aggregate_contigs in circularize.smk. You might need to document it to a similar level to aggregate_medaka_polished_contigs in polish.smk. Thoughts?

LeeBergstrand commented 3 months ago

@LeeBergstrand Thanks very much for adding this enhancement to parallelization! The overall logic looks good to me, although I did not test the code to see if it works in practice (it sounds like you've tested it successfully?). I wonder if some parts of the code are leftover from the intermediate version when GNU Parallel was used, so I flagged a couple things that seemed like they might be extraneous. Exciting that it sounds like you've seen good speedups with multiple genomes. This should also be a great addition for doing metagenome assembly, like we chatted about. Let me know your feedback about my comments, and then I think we should be good to merge!

I have tested this successfully on both single and multiple samples.