stephaneguindon / phyml

PhyML -- Phylogenetic estimation using (Maximum) Likelihood
GNU General Public License v3.0
175 stars 61 forks source link

interpreting the log-likelihood in output #162

Closed ChangxuFan closed 2 years ago

ChangxuFan commented 2 years ago

Dear authors,

I ran phyml on a gene family to build a tree. Looking at the results, I'm a bit worried about the log-likelihood value: it's -754, which means the likelihood is almost zero! Does this mean that the program has little confidence in the estimated parameters or the tree topology? I was wondering if I'm understanding this incorrectly. I would highly appreciate it if you could point me to some resources on this.

Thank you so much!!


 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
                                  ---  PhyML 3.3.20190909  ---                                             
                              http://www.atgc-montpellier.fr/phyml                                          
                             Copyright CNRS - Universite Montpellier                                 
 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

. Sequence filename:            exon3_wb_aligned_phy
. Data set:                 #1
. Initial tree:             BioNJ
. Model of nucleotides substitution:    GTR
. Number of taxa:           52
. Log-likelihood:           -754.13849
. Unconstrained log-likelihood:     -348.41766
. Composite log-likelihood:         -6438.01266
. Parsimony:                119
. Tree size:                1.35199
. Discrete gamma model:         Yes
  - Number of classes:          4
  - Gamma shape parameter:      1.901
  - Relative rate in class 1:       0.28116 [freq=0.250000]         
  - Relative rate in class 2:       0.64406 [freq=0.250000]         
  - Relative rate in class 3:       1.06730 [freq=0.250000]         
  - Relative rate in class 4:       2.00748 [freq=0.250000]         
. Nucleotides frequencies:
  - f(A)=  0.37232
  - f(C)=  0.24092
  - f(G)=  0.17327
  - f(T)=  0.21350
. GTR relative rate parameters :
  A <-> C    0.82212
  A <-> G    1.82689
  A <-> T    0.53724
  C <-> G    0.17829
  C <-> T    2.00016
  G <-> T    1.00000
. Instantaneous rate matrix : 
  [A---------C---------G---------T------]
  -0.82453   0.25951   0.41474   0.15028  
   0.40104  -1.00102   0.04048   0.55950  
   0.89119   0.05628  -1.22720   0.27973  
   0.26208   0.63136   0.22702  -1.12046  

. Run ID:               none
. Random seed:              1625516914
. Subtree patterns aliasing:        no
. Version:              3.3.20190909
. Time used:                0h0m4s (4 seconds)

 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
 Suggested citations:
 S. Guindon, JF. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, O. Gascuel
 "New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0."
 Systematic Biology. 2010. 59(3):307-321.

 S. Guindon & O. Gascuel
 "A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood"
 Systematic Biology. 2003. 52(5):696-704.
 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooo
stephaneguindon commented 2 years ago

The log-likelihood represents the logarithm of the probability of your sequence alignment (given the phylogenetic model with optimal model parameters). The number of possible alignments is huge: 4^(sequence length x number of sequences) for nucleotide data, so that it should not be so surprising that the probability of each of them is a small number. A model where the sequence alignment is made of nucleotides that are chosen uniformly at random would give a log-likelihood of -sequence length x number of sequences x log(4). You should perhaps compare the log-likelihood of the phylogenetic model (i.e., ~-700 in this case) to that value so as to assess how small this value is compared to what is expected under a dull model. Hope that makes sense.