gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
260 stars 33 forks source link

Problems with multithreading #235

Closed thorellk closed 1 year ago

thorellk commented 1 year ago

Hi!

I am running panaroo on a dataset of ~10 k rather small bacterial genomes. I am using our HPC, more specifically a 256GB node with 20 cores. I have set panaroo to use all 20 cores but it still runs a large part of the analysis using only a few of them (it's now at Processing paralogs step and has been so for many days now). Is this expected? If so, is there any way of making this part more efficient or is it just to wait? Attached you can find the jobstats plot.

Thank you for a really nice software,

Kaisa

rackham-naiss2023-22-479-kaisa-38361856

nzmacalasdair commented 1 year ago

Hi Kaisa,

This seems likely to be normal behaviour - panaroo generally tries to make good use of resources provided to it, but some stages of the method (particularly those involving the draft pangenome network) are iterative, and therefore run single-threaded at the moment.

Total runtime depends primarily on the complexity of the draft pangenome graph, which is affected by genome size and number of isolates, as well as other factors. Datasets of > 10K isolates can take a considerable amount of time to run in a single panaroo run.

There are a number of ways to speed this up/make sure the total runtime falls below HPC limits:

  1. The easiest thing to do for large datasets (if you are interested in a core or pan genome alignment) is to separate the pangenome inference step (panaroo) from the gene alignment step, by running panaroo-msa separately after running panaroo without any gene alignment (the -a flag).
  2. Splitting up large datasets into smaller sets, running panaroo on these smaller datasets, and then running panaroo-merge to combine their output should be quicker than running the entire dataset with panaroo. The smaller datasets can be informed by descent from a common ancestor (ie, clustering), which may make them more interesting to analyse on their own/compare, but can also be random. If you are interested in core or pan genome alignments, you can always run panaroo-msa using the output folder from panaroo-merge

Hopefully this helps! Let us know if anything is unclear or if you run into any problems.

thorellk commented 1 year ago

Hi!

Thank you for your swift reply! I suspected this was the case and I understand that some steps are inherently hard/impossible to parallelize. Then I guess I should just stay patient and kindly ask the HPC support to extend the duration of my SLURM job :)

Thank you for the tips on how to combine different features of panaroo. I think especially the latter would make sense for me since I am currently "paying" for quite a lot of unused core hours at the cluster. Are there any drawbacks of using this approach? What if, for example two homologous genes gets clustered in one of the subsets but split in the other, how will panaroo-merge deal with that?

nzmacalasdair commented 1 year ago

Depending on cost/computational limits, you may want to consider cancelling the SLURM job and start running subsets -- datasets with high draft pangenome graph complexity can take weeks to finish in a single run. It may be worth examining the complexity of the pre_filt_graph.gml file if you'd like to have some rough idea of how long it might take.

Running panaroo-merge should produce a very similar graph to the output of panaroo, the speed increase is primarily due to 'parallelising' the initial error-correcting steps on the draft pangenome graph, by running them on multiple smaller graphs, instead of a single large, complex graph. The merge process then uses the final graphs from each of initial runs as starting input, and performs similar clustering methods to a normal panaroo run to infer the combined pangenome, not just a simple merging of the networks themselves. It's not easy to comment on specific examples, but the process has similar user options to control clustering as panaroo.

As for drawbacks, the most significant is probably the additional user input required to create the data subsets and run multiple panaroo runs. The merge command itself can take some time as well, though typically much less time that running the entire dataset through panaroo. Finally, you should see bigger benefits from using panaroo-merge if you are running panaroo with --clean-mode strict as that will lead to the biggest differences between the draft and final pangenome networks for each subset.