iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
234 stars 55 forks source link

[Feature Request] Add support for the `--link-exchange-rates` option in the MAST model #285

Open StefanFlaumberg opened 1 month ago

StefanFlaumberg commented 1 month ago

Dear IQ-Tree team,

In a recent paper you have shown that re-estimating the substitution matrix under a profile mixture model on a database of relevant sequences (resulting in a GTRpmix matrix) may improve phylogenetic reconstruction accuracy. However, such matrix reestimation itself needs a guide tree, thus posing a self-reference problem as one would like to re-estimate the matrix to improve reconstruction of the very same tree being used as the guide tree. To put it shorter, the true topology of what should be used as a guide tree is usually unknown. Fortunately, in practice we usually know the general topology of a species tree, but not sure about just several bipartitions in it. This leads to an elegant solution -- to use the tree-mixture model (MAST) with equal tree-weights during GTRpmix matrix estimation to express our partial knowledge about the guide tree topology.

Currently MAST works well with frequency profile mixtures, but cannot link the GTR20 matrix parameters across the frequency profiles. One gets a segmentation fault on trying to include the --link-exchange-rates option, like this:

Estimate model parameters (epsilon = 0.99000)
1. Initial log-likelihood: -10878.02429
ERROR: STACK TRACE FOR DEBUGGING:
ERROR:
ERROR: *** IQ-TREE CRASHES WITH SIGNAL SEGMENTATION FAULT
ERROR: *** For bug report please send to developers:
ERROR: ***    Log file: ./aln_iqtree_gtrpmix.log
ERROR: ***    Alignment files (if possible)
803984 Segmentation fault      iqtree2 -seed 123 -nt 3 -mem 3G -s ./aln.fasta -m "GTR20+C10+T[x,x]" --gtr20-model "LG" --link-exchange-rates -mwopt -te ./trees.nwk -me 0.99 -pre ./aln_iqtree_gtrpmix

Could you, please, implement the --link-exchange-rates option in the MAST model for the approach to work? Thank you!

Best, Stefan

thomaskf commented 1 month ago

@StefanFlaumberg Thanks for the suggestion! This is a good idea. We are currently busy with various projects but I will consider to do so, perhaps in the coming few weeks/months.

roblanf commented 1 month ago

Hi Stefan,

Related to this, we are working on a different solution to this problem. I'm not totally convinced that MAST is the right way to go here - I like the idea in principle (as I like all ideas for making all the different avenues of IQ-TREE work together), but the problem is that orthogonal mixture classes are multiplicative. So, if you have e.g. 5 MAST trees (i.e. tree classes), 60 profiles (i.e. frequency classes), and e.g. a +R4 model (i.e. 4 rate classes), then every site has 5604 = 1200 likelihoods to calculate, and any estimation will need 1200 times the RAM of estimating a single likelihood per site.

Because of this, anything we can do to reign in the number of classes is useful. One is to assume a tree.

So, another solution to the circular problem is to do what is internal to phylogenetics programs anyway, and:

  1. Infer a tree
  2. Infer a new model
  3. Go to 1, until convergence

W.r.t. convergence, you could look at the correlation of the Q matrix from 1 iteration to the next. Le and Gascuel did that for the LG model, and we copied them for the QMaker paper (I think we set the correlation had to be >0.999). We have been using the same approach for lots of estimates of Q matrices, and in my experience the process almost never goes beyond 2 iterations (even if the tree changes a decent amount after the first iteration), suggesting that in most cases the tree is not too important for estimating the Q matrix.

I hope some of that helps.

Rob