liamrevell / phytools

GNU General Public License v3.0
198 stars 56 forks source link

Disagreement between phylosig(, method="lambda") and plot.phylosig() - possible bug? #128

Closed cameron-weadick closed 12 months ago

cameron-weadick commented 1 year ago

Hello,

I'm having trouble reconciling the results of phylosig(, method="lambda") and the plot.phylosig() likelihood surface for that phylosig() fit. When I apply it to a small test data set, the results don't match. I can't explain the behaviour and I suspect a bug. Can you check on this?

On my test data set, geiger::fitContinuous(,model="lambda") returns a MLE for lambda of 0.000 whereas phylosig(, method="lambda") returns a MLE for lambda of 0.626. Many of the fitContinuous() optimization iterations landed on 0.626. suggesting that phylosig() was simply stuck on a suboptimal lambda value. To explore that, I plotted the likelihood surface using plot.phylosig(). Surprisingly, this showed that the MLE is being placed at a point that is very clearly not the peak of the likelihood surface. What's going on with this? Have I misinterpreted/miscoded something?

I've attached a figure and some code to demonstrate. Many thanks for your help with this question, and with phytools in general!

# load packages
require(geiger)
require(phytools)

    # > sessionInfo()
    # R version 4.2.1 (2022-06-23)
    # Platform: x86_64-apple-darwin17.0 (64-bit)
    # Running under: macOS Monterey 12.6.3
    # ...
    # other attached packages:
    # [1] phytools_1.5-1 maps_3.4.1     geiger_2.0.10  ape_5.7  

# load data
data(geospiza)
gtr = geospiza$geospiza.tree
gda = geospiza$geospiza.data[,"wingL"]

# drop mismatched data
name.check(gtr, gda)
gtr = drop.tip(gtr, name.check(gtr, gda)$tree_not_data)
name.check(gtr, gda)

# estimate lambda on data via two different methods, phytools::phylosig and geiger::fitContinuous
lambda_sig1 = phylosig(gtr, gda, method="lambda", test=T)
lambda_sig2 = fitContinuous(gtr, gda, model="lambda")

# check results - hmmm, different results depending on method
# phytools results = ~0.626 whereas geiger results = 0.000 (lower boundary)
lambda_sig1
    # # Phylogenetic signal lambda : 0.626225 
    # # logL(lambda) : 9.72957 
    # # LR(lambda=0) : -0.151315 
    # # P-value (based on LR test) : 1 
lambda_sig2
    # # GEIGER-fitted comparative model of continuous data
     # # fitted ‘lambda’ model parameters:
        # # lambda = 0.000000
        # # sigsq = 0.022206
        # # z0 = 4.235734

     # # model summary:
        # # log-likelihood = 9.805223
        # # AIC = -13.610445
        # # AICc = -10.943778
        # # free parameters = 3

# it seems that fitContinuous finds lambda=0.626 to be a suboptimal peak
# most runs get lambda=0 with lnL = 9.8052
fcFit = data.frame( lnL    = round(lambda_sig2$res[,"lnL"]   , 3), 
                    lambda = round(lambda_sig2$res[,"lambda"], 3))
table(fcFit)
    # # > table(fcFit)
             # # lambda
    # # lnL        0 0.626
      # # -1e+200  9     0
      # # -88.096  1     0
      # # -62.832  1     0
      # # 9.73     0    28
      # # 9.805   61     0

# visually check likelihood surfaces by calculating likelihood for the phylosig results
# phylosig seems to be ignoring the ML peak at lambda=0.  Why?
plot(lambda_sig1)
Screenshot 2023-03-07 at 15 46 49
liamrevell commented 1 year ago

Thanks for bringing this to my attention @cameron-weadick with such a nice reproducible example. It looks like stats::optimize in base R is just getting trapped on a local optimum -- which surprises me, as it's really not a very difficult problem to optimize!! phylosig actually exports the likelihood function as one of the object components (which is how the corresponding plot generic method can graph the likelihood surface that you showed). To see that this surface has a maximum at the MLE, try running the following code snippet:

lambda<-seq(0,1,length.out=10000)
logL<-sapply(lambda,lambda_sig1$lik)
plot(lambda,logL,type="l")
abline(v=lambda[which(logL==max(logL))],
    lty="dotted")
cat(paste("lambda: ",
    lambda[which(logL==max(logL))],"\n",
    "log(L): ",max(logL),"\n",sep=""))

I'm going to look into adding an alternative optimizer and see if that helps!

cameron-weadick commented 1 year ago

Thank you @liamrevell for the quick reply. I'm guessing stats::optimize had problems because the real MLE wasn't at an inflection point, but this isn't really something I'm very knowledgable about. Hopefully a suitable optimizer can be found, but if one can't then maybe a few checks can be added after optimization to flag suspicious runs, e.g. have phylosig evaluate the likelihood function at the parameter boundaries, or at intervals within that range, and do a quick test for any with higher likelihood than the putative MLE.

liamrevell commented 12 months ago

This should now be fixed. Please re-open if you're still seeing this error after updating phytools from GitHub. Thx!