nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
175 stars 51 forks source link

Multiple outgroups may inflate execution time #357

Closed stitam closed 1 year ago

stitam commented 1 year ago

Hi, at this point this is more a question than an issue..

I would like to build a tree in which I set multiple outgroups. My outgroups are from a diverse set of taxa, some not even in the same genus as the lineage for which I would like to build the tree (which I am building for a sequence type within a species). I use one of the non-outgroup strains as reference to construct pseudo-whole genomes with snippy, and then use the longest contigs to build the tree with gubbins. I noticed that the execution time was heavily inflated compared to a run with no outgroups specified, it took multiple hours when it should have completed in about 30 minutes. I am trying to figure out what is causing this behaviour. So far I have these guesses:

Any ideas what may be causing this problem? Many thanks.

nickjcroucher commented 1 year ago

Using very divergent outgroups will greatly increase run time because of the computation needed to reconstruct the patterns of base substitutions across many polymorphic sites. I would strongly recommend using a single, closely-related outgroup.

stitam commented 1 year ago

Many thanks @nickjcroucher for your help, this is a super important detail. I'll go ahead as you suggested.