iqtree / iqtree2

NEW location of IQ-TREE software for efficient phylogenomic software by maximum likelihood http://www.iqtree.org
GNU General Public License v2.0
231 stars 55 forks source link

[Feature Request] Add site-specific rate profile support #244

Open StefanFlaumberg opened 2 months ago

StefanFlaumberg commented 2 months ago

Dear IQ-Tree team,

Currently IQ-Tree2 implements inference of posterior mean site rate (PMSR) profiles, given an alignment and the corresponding tree with branch lengths. So I wonder, why cannot one use these inferred site-specific rates (after normalization by their mean) for further tree refinement, somewhat like it was done in the Mayrose et al. 2004 paper for branch length estimation? It would be great to have a -rs option to pass a precomputed rate profile to for tree inference (analogous to the -fs option for PMSF profiles). Could you, please, implement such an option?

I understand that tree inference under PMSR profile model hasn't been extensively tested yet, but given the success of the PMSF profile approach, the similar usefulness of PMSR model is quite expected.

Best regards, Stefan

bqminh commented 2 months ago

Hi Stefan, This is a very good suggestion, thanks for bringing it up. As you noted, the "PMSR" is already reported in .rate file if you run -wsr option. In principle, one can apply this PMSR to do tree inference in the same way as the PMSF, which is indeed something I thought about and discussed with Ed Susko a few years ago. Moreover, it'd be nice for users to test other ways of obtaining site-specific rates for tree inference. However, there are some caveats:

The last point is actually the main point that holds me back from pursuing this idea, and we need to allocate developer to implement it, which in turn needs some funding or there is way of publishing it.

Whereas: there is already a way of doing this PMSR approach (not that efficiently though). I had some emails about this which I can dig out and reply later, if you wanna try it out.

roblanf commented 2 months ago

I would just throw in here also that PMSF is really just a shortcut way of doing very complex models like C60. Rate models tend to be quite a bit simpler than that for the most part (i.e. not 60 mixture classes). Still, the +R10 and greater models are tricky to optimise, so perhaps a PMSR model might be worth the time saving there.

I think the primary worry for me though is that you have to estimate the rate profile on a tree. I worry as Minh does that this could bias inference on exactly the nodes that matter (i.e. short branches which are hard to resolve). In fact this comment reminds me that we should get to work on testing exactly this potential bias in the PMSF models.

StefanFlaumberg commented 2 months ago

Dear Minh and Rob, Thank you for your comprehensive replies!

You have written a lot about the possible impact the usage of the PMSR model may have on the running time. The model surely is not going to accelerate the tree inference in most of the usage cases (unless being compared against running with +R10, as you mentioned).

However, from my standpoint the main focus here should be on the tree reconstruction accuracy. Judging from the original article on the PMSF model, the model, given enough alignment data, doesn't produce biased results, when estimated on a reasonably optimal guide tree (like the +C20 tree). In fact, it was shown to produce more accurate results than the +C20 model. By analogy one could expect the same from the PMSR model, though a paper thoroughly testing PMSR and PMSF for biases, especially in application to single-protein tree reconstruction, would be much relevant.

It doesn't seem right to consider site-specific models to be just shortcuts of mixture models. In the logic of mixture approach, by which one sums the weighted log-likelihoods calculated over the whole alignment under different site-homogeneous models, every alignment site gets modeled by bad-fitting models in some of the categories, but the impact of such non-optimal modelling is mitigated by the weighting scheme. On the contrary, the site-specific approach, while possibly facing some risk of overfitting, models each alignment site by a model tailored to closely replicate the process governing the evolution of the site. The site-specific approach thus seems much more natural.

In line with the above, I suppose that using a joint site-specific frequency and rate profile model is the best thing one could do to fully model site-heterogeneity without extensive overfitting or time consumption. Such a PMSFR profile could be roughly obtained by inferring a PMSF profile on a +C20+R tree in the first step and then inferring a PMSR profile on the same tree under the +PMSF+R model in the second step. I managed to use such a PMSFR profile for tree inference in the RAxML-NG tool by passing the profile as a per-site partition with AA frequencies and branch-length multipliers. However, due to a computational problem with per-site partitions, the approach turned out to be not very efficient (a 20-50 fold slow down compared to site-homogenous model, barring the estimation of profiles by IQ-Tree step).

Dear Minh, if, as you mentioned, there already exists a way to try the PMSR approach in IQ-Tree2 (or even to jointly use it with a PMSF profile), I am very interested to know about it. So I'm looking forward to seeing your reply. Thank you!

Best regards, Stefan