matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Extreme Runtime #124

Open lejecrs opened 6 months ago

lejecrs commented 6 months ago

Hi! I was using GCTree for the inference of phylogenetics based on sequence number of about 1000, it is super slow. Do you have any optimization on that?

willdumm commented 6 months ago

Which part of the pipeline is slow? With that many sequences, I would expect the dnapars command to take quite awhile, especially if there are many parsimony trees. How many trees are reported in the output file outfile that dnapars produces? If there are many, then it's likely that the gctree infer command will also take awhile, but it's difficult to be sure.

I don't really have any suggestions to make inference faster, although there's a small chance that using a gctree version before v4.0.0 might work better.

lejecrs commented 6 months ago

Thanks for the reply. Yes! dnapars is super slow. For the 1000 sequence, it hasn't finished yet on the server for 2 days. For 300 sequences run on my laptop (Macbook Pro 2020), I calculated the runtime of the whole pipeline and it ranges 15000-24000 seconds (4-6 hrs) depending on different data. The 300 sequences will at most 2 inferred trees by the GCTree.

lejecrs commented 6 months ago

I think that the dnapars is slow because I tested the gctree infer runtime and it finished in a few seconds.

willdumm commented 6 months ago

The only suggestion I have is that you could do a less thorough tree search with dnapars, by providing the --quick argument to the mkconfig command. Of course, the quality of the final inferred trees may decrease. Besides this, I have no recommendation, phylogenetic inference on thousands of sequences tends to be quite slow. It's possible that iqtree will give you a tree in a more reasonable amount of time than dnapars does, in which case you could consider using that tool instead of gctree.

lejecrs commented 6 months ago

Thank you! Any potential modification on distributing the computation? For example, if we could manually adjust the therads used by GCTree?