improve the score and "validate" results

oscarIM commented 3 years ago

Hello, I hope you are feeling well. I am writing to ask you a question. I ran an analysis (with the code below), and at the end , the program delivers the following message: logLik R df score canImprove 1: -3181.902 2 10 6383.804 TRUE 2: -3177.004 3 15 6384.008 TRUE. Does this mean that by some codes fix (e.g. setting minCladeSizes or maxNumRoundRobins) you can get a more robust result? And the other thing, how can I be "sure" that the best delivered model really represents the evolution of my variable? I ask because this is a relatively large tree (1005 tips) and it is "expected" that there will be a larger number of regime/model changes (in my case there is just two OU regimen). If you need the data, I can send it to you. Regards Here the code used:

library(PCMBaseCpp)
library(PCMFit)
library(data.table)
library(ggplot2)
library(ggtree)
library(ape) 
# A character string used in filenames for a model inference on a given data:
set.seed(12345)
prefixFiles = paste0("MGPM_A_F_Mammals")
cluster <- parallel::makeCluster(24, outfile = paste0("log_", prefixFiles, ".txt"))
doParallel::registerDoParallel(cluster)
# This function is going to be executed on each worker node.
generatePCMModelsFunction <- function() {
  # make results reproducible
  set.seed(12345, kind = "Mersenne-Twister", normal.kind = "Inversion")
  PCMGenerateModelTypes()
  fileName <- "DefineParameterLimits.R"
  codeDefineLimits <- readChar(fileName, file.info(fileName)$size)
  eval(parse(text = codeDefineLimits), .GlobalEnv)
}
while(TRUE) {
points <- PCMTreeLocateMidpointsOnBranches(tree, 16)
if(length(points$nodes) == 0) {
  break
} else {
  tree <- PCMTreeInsertSingletons(tree, points$nodes, points$positions)
}
}
data <- read.csv("data_cleaned.csv")
X <- data$mean
names(X) <- data$species
X <- t(X)
currentResultFile <- paste0("Analyses_mammals_", prefixFiles, ".RData")
if (file.exists(currentResultFile)) {
  load(currentResultFile)
  tableFitsPrev <- listResults$tableFits
} else {
  tableFitsPrev <- NULL
}
#run
#global BM
modelBM <- PCM(model = "BM", k = 1)
fitBM <- PCMFit(model = modelBM, tree = tree, X = X, metaI = PCMBaseCpp::PCMInfoCpp)
#global OU
modelOU <- PCM(model = "OU", k = 1)
fitOU <- PCMFit(model = modelOU, tree = tree, X = X, metaI = PCMBaseCpp::PCMInfoCpp)
#run full model
fitMGPM <- PCMFitMixed(
  X = X, tree = tree, metaIFun = PCMInfoCpp,
  maxNumRoundRobins = 2, maxNumPartitionsInRoundRobins = 2,
  minCladeSizes = 20L, doParallel = TRUE)

#comparations
bestFit <- RetrieveBestFitScore(fitMGPM)
listModels <- list(
  RetrieveBestModel(fitBM),
  RetrieveBestModel(fitOU),
  bestFit$inferredModel)

dtSummary <- data.table(
  model = c(
    "Global BM",
    "Global OU",
    "Full model"),
  p = sapply(listModels, PCMParamCount),
  logLik = sapply(listModels, logLik),
  AIC = sapply(listModels, AIC))
knitr::kable(dtSummary)
|model      |  p|    logLik|      AIC|
|:----------|--:|---------:|--------:|
|Global BM  |  3| -3192.683| 6391.367|
|Global OU  |  5| -3188.553| 6387.106|
|Full model |  7| -3181.902| 6383.804|

venelin commented 3 years ago

Hi,

canImprove = TRUE means that the last RR iteration has improved the model score. If this is the last status printed, this means that the maximum number of RR iterations was reached before stopping to improve the score. Hence, increasing the maxNumRoundRobins parameter could potentially result in a better score of the final fit at the cost of longer computation.

To your second qustion. PCMFit is searching for the optimum of a score function over a vast space of candidate models (MGPMs). The score function itself is derived from the likelihood, that is, the probability of observing the trait values at the tips of a fixed tree under a candidate mixed Gaussian phylogenetic model (MGPM), and the complexity of this MGPM, expressed in terms of number of parameters and shift points. In other words, the final result is a trade-off between model complexity and likelihood value. Since this search uses heuristics to reduce the search space, there's no guarantee that PCMFitMixed would find the global optimum of the score function. Moreover, there is no guarantee that the score function is the perfect one. This is why, you can never be "sure", and the best interpretation of the final model should be as a possible evolutionary hypothesis. With a tree of this size, I would consider different strategies, e.g. (i) Calling PCMFitMixed on a subtree; (ii) explore the data visually and see if i can identify possible shift points (clades with different trait distributions); (iii) Simulate data on the same tree using some shift-point configuration and model parameters and see if PCMFitMixed can find a solution close to the true one.

More about the tests validating PCMFit against other implementations and score functions can be found in the appendix to the MGPM paper: https://www.pnas.org/highwire/filestream/881961/field_highwire_adjunct_files/0/pnas.1813823116.sapp.pdf .

Hope this helps!

oscarIM commented 3 years ago

Thank you very much for your quick response! I will try to implement your suggestions

venelin / PCMFit

improve the score and "validate" results #9