roblanf / sarscov2phylo

Global phylogenies of SARS-CoV-2 sequences
GNU General Public License v3.0
86 stars 22 forks source link

compare using ML tree as starting tree for bootstraps #8

Open roblanf opened 4 years ago

roblanf commented 4 years ago

This is recommended in the fastttree documentation: http://www.microbesonline.org/fasttree/#Gamma20

If you want to use the traditional bootstrap instead, you can use phylip's SEQBOOT to generate resampled alignments, the -n option to FastTree to analyze all of the resampled alignments with one command, and CompareToBootstrap.pl to compare the original tree to the resampled trees. For alignments with thousands or tens of thousands of sequences, we recommend using the tree for the full alignment as the starting tree for each resampled replicate (the -intree1 option). This "fast global" bootstrap is quite fast and accurate -- for an alignment of 40,000 ABC transporters, 100 fast-global bootstraps took just 20 hours, and the resulting support values were strongly correlated with the traditional bootstrap (r=0.975). (Note -- this analysis was performed with FastTree 1.1, before we implemented maximum-likelihood NNIs.)

A couple of considerations to check before going down this route.

  1. What is the time saving? It may be considerable if we combine this with using the OMP version of fasttrree to estimate the ML tree. It will be a lot less of a time saving without this. E.g. it would not save any time if one had 101 processors and no OMP version, since all bootstraps AND the ML analysis would all finish in roughly the same time. As it is, it takes me ~3 rounds to complete all bootstraps (due to memory limitations), and this will get worse as the datasets increase in size.

  2. Do the bootstrap supports correlate. The above text from fasttree suggests they should correlate well, but it's an empirical question really, and should be quantified for these data.

roblanf commented 4 years ago

Using a starting tree, and then each bootstrap with:

      fasttree -nosupport -nome -nt -fastest -intree "$START_TREE" $bootpre'0.fa' > $bootpre'unrooted.tree'

Saves 2/3 of the time

BUT, it also massively increases the TBE values.

Here's before

image

and here's after:

image

So this renders the TBE values very difficult to interpret, which is not OK.

One thing left to try is to keep the minimum evolution steps in the process. This may work, but it reduces the time saving to 1/3 on these datasets.

Another thing to try is to use the builtin bootstrapping of fasttree, which I only just discovered.

bqminh commented 4 years ago

Or could it be due to FastTree, instead of TBE? Zhou et al (https://doi.org/10.1093/molbev/msx302) showed that the branch supports increase with FastTree.

roblanf commented 4 years ago

Good point @bqminh. In this case though, both are with fasttree. I should have made that clear. The first one is 100 bootstraps where each bootstrap alignment first comes from goalign but with no starting tree. The second one is exactly the same, but the fastree analysis is also given the starting tree.