nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

Set recommended minimum branch length for IQ-TREE #762

Open huddlej opened 2 years ago

huddlej commented 2 years ago

Context

As part of a separate issue related to IQ-TREE and this workflow, we learned that the minimum branch length for IQ-TREE with SARS-CoV-2 data should be much lower—-blmin 0.0000000000001 is suggested—than its default value.

Rob Lanfear says:

There are a lot of true zero length branches in most analyses. The default in IQ-TREE is to set -blmin to 0.1/alignment_length, which is about 0.00003, and that can have a pretty large effect on the likelihoods if the truth is zero.

Description

We should modify the default tree args in defaults/parameters.yaml to read:

tree:
  tree-builder-args: "'-ninit 10 -n 4 -blmin 0.0000000000001'"

We might also want to test the run time and topology of the corresponding output compared to a tree built from the same data with a different min branch length parameter.

jameshadfield commented 2 years ago

This needs testing - I ran this on a dataset today using tree-builder-args: "'-ninit 100 -n 40 -blmin 0.0000000000001'" and it failed (leaving out -blmin works, as expected).

Command output was:
  OMP: Info #270: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  ERROR: Numerical underflow (lh-derivative). Run again with the safe likelihood kernel via `-safe` option

ERROR: TREE BUILDING FAILED
huddlej commented 2 years ago

Thanks for trying this out, @jameshadfield! I ran into the same issue with a Nebraska-focused tree (N=~1000 strains). Adding -safe initially worked as expected, although IQ-TREE printed a bunch of warnings like:

WARNING: Numerical underflow for lh-derivative

But when I ran IQ-TREE again on the same data with the same parameters, it errored out like this:

ERROR: Shell exited from fatal signal SIGABRT when running: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 4 -s results/Nebraska/aligned-delim.fasta -m GTR -ninit 10 -n 4 -blmin 0.0000000000001 -safe > results/Nebraska/aligned-delim.iqtree.log
Command output was:
  OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  ERROR: phylokernelnew.h:3268: double PhyloTree::computeLikelihoodFromBufferSIMD() [VectorClass = Vec4d, nstates = 4, FMA = true, SITE_MODEL = false]: Assertion `std::isfinite(tree_lh) && "Numerical underflow for lh-from-buffer"' failed.
  ERROR: STACK TRACE FOR DEBUGGING:
  ERROR: 2   _sigtramp()
  ERROR: 5   double PhyloTree::computeLikelihoodFromBufferSIMD<Vec4d, 4, true, false>()
  ERROR: 6   PhyloTree::getBestNNIForBran(PhyloNode*, PhyloNode*, NNIMove*)
  ERROR: 7   IQTree::evaluateNNIs(std::__1::map<int, std::__1::pair<Node*, Node*>, std::__1::less<int>, std::__1::allocator<std::__1::pair<int const, std::__1::pair<Node*, Node*> > > >&, std::__1::vector<NNIMove, std::__1::allocator<NNIMove> >&)
  ERROR: 8   IQTree::optimizeNNI(bool)
  ERROR: 9   IQTree::doNNISearch(bool)
  ERROR: 10   IQTree::doTreeSearch()
  ERROR: 11   runTreeReconstruction(Params&, IQTree*&)
  ERROR: 12   runPhyloAnalysis(Params&, Checkpoint*)
  ERROR:
  ERROR: *** IQ-TREE CRASHES WITH SIGNAL ABORTED
  ERROR: *** For bug report please send to developers:
  ERROR: ***    Log file: results/Nebraska/aligned-delim.fasta.log
  ERROR: ***    Alignment files (if possible)
  /bin/bash: line 1:  1825 Abort trap: 6           iqtree2 -ninit 2 -n 2 -me 0.05 -nt 4 -s results/Nebraska/aligned-delim.fasta -m GTR -ninit 10 -n 4 -blmin 0.0000000000001 -safe > results/Nebraska/aligned-delim.iqtree.log

There's clearly some stochastic behavior with that recommended value. I tried reducing the precision to 0.000000001 (from 13 to 9 decimal places) and that worked much more consistently. Interestingly, though, the run-time is widely variable with the same data; some runs take 35 seconds and others take 5 minutes to get to the same tree!

Here is a tangle-tree of the current params (left) vs. the min-length param above (right):

BA0FC561-9752-4B16-878F-759154D9BC39

The two trees are qualitatively the same. There are a couple of additional mutations in the original tree that don't appear in the min-length tree. Here is a zoomed in view of a specific clade:

FAF94F16-4CCA-4C79-808E-B51D21A69228

I plotted the number of mutations per branch for all nodes in the two trees.

unknown

There isn't much difference between the two trees. Maybe the difference would be more pronounced for much larger trees or if we were looking at the overall tree likelihoods, but I don't know that this change is worth it...

rneher commented 2 years ago

My suggestion here would be to set blmin to 0.003/alignment_length, which is about 0.000001. That should have less effect on underflows etc, but should fix the problem because the largest number of consecutive 0-length branches is like less than 100.

huddlej commented 2 years ago

FWIW, I stumbled on Rob's analysis that led him to suggest this very small minimum branch length. His context (analyzing 40,000+ sequences) and Nextstrain's (analyzing ~4,000 sequences) are pretty different. We might have higher priority issues to address right now than marginally optimizing the tree likelihoods...