Closed halabikeren closed 6 years ago
// example for log-likelihood computations given fixed parameters (no optimization):
// two trees, same topology, one bl is different:
simpleTree1 = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);";
simpleTree2 = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);";
// same data
DataSet data = ReadFromString (
"
>A
A
>B
C
>C
A
>D
A
"
);
// SLKP if you don't set this flag, the branch lengths won't be
// scaled as you expect them to be
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1;
DataSetFilter dataFilter = CreateFilter (data,1);
RateMatrix = {{,mu,mu,mu}
{mu,,mu,mu}
{mu,mu,,mu}
{mu,mu,mu,}};
Freqs = {{0.25},{0.25},{0.25},{0.25}};
Model M = (RateMatrix, Freqs);
Tree T1 = simpleTree1;
Tree T2 = simpleTree2;
// define likelihood computation objects on each tree and the data:
LikelihoodFunction fit1 = (dataFilter, T1);
LikelihoodFunction fit2 = (dataFilter, T2);
fprintf(stdout, "fit1 model: ", M, "\n");
fprintf(stdout, "fit2 model: ", M, "\n");
// computed log-likelihoods:
LIKELIHOOD_FUNCTION_OUTPUT = 0;
fprintf(stdout, "fit1: ", fit1, "\n");
fprintf(stdout, "fit2: ", fit2, "\n");
Results in
fit1: -5.46845944689568
fit2: -4.91190852366769
Dear Sergei,
Thank you for the quick reply! We are now facing another challenge. When we set mu=2, we get the same log-likelihood scores as with mu=1.
Just to be specific, using your example, we added the line
mu = 2.0;
just before defining the rate matrix.
This resulted in: fit1: -5.46845944689568 fit2: -4.91190852366769
as before...
This doesn't happen in our previous R example so we are obviously lacking some knowledge about how HyPhy treats rates and branch lengths. Reverse engineering doesn't seem to help us too much...
It matters to us because eventually, we would like to optimize global parameters over a set of trees whose topologies and branch lengths are fixed.
Thanks once again, Keren and Eli.
Dear @halabikeren,
In your particular example, mu
is essentially the same as the time parameter. This value will be overwritten by HyPhy when it sets branch lengths. This amounts to solving
BranchLength (mu) = C, where
C is the user value
For JC69, BranchLength (mu) = 3/4 * mu
If you wish to scale the tree, while keeping relative branch lengths constant, do something like this
// example for log-likelihood computations given fixed parameters (no optimization):
// two trees, same topology, one bl is different:
simpleTree1 = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);";
simpleTree2 = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);";
// same data
DataSet data = ReadFromString (
"
>A
A
>B
C
>C
A
>D
A
"
);
// SLKP if you don't set this flag, the branch lengths won't be
// scaled as you expect them to be
AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1;
DataSetFilter dataFilter = CreateFilter (data,1);
global branch_scaler = 1;
RateMatrix = {{,branch_scaler*mu,branch_scaler*mu,branch_scaler*mu}
{branch_scaler*mu,,branch_scaler*mu,branch_scaler*mu}
{branch_scaler*mu,branch_scaler*mu,,branch_scaler*mu}
{branch_scaler*mu,branch_scaler*mu,branch_scaler*mu,}};
Freqs = {{0.25},{0.25},{0.25},{0.25}};
Model M = (RateMatrix, Freqs);
Tree T1 = simpleTree1;
Tree T2 = simpleTree2;
// define likelihood computation objects on each tree and the data:
LikelihoodFunction fit1 = (dataFilter, T1);
LikelihoodFunction fit2 = (dataFilter, T2);
// computed log-likelihoods:
LIKELIHOOD_FUNCTION_OUTPUT = 0;
for (branch_scaler = 0.1; branch_scaler <= 2.1; branch_scaler += 0.1) {
fprintf(stdout, "logL (data | tree 1, (scaler = `Format (branch_scaler,3,1)`)) = ", fit1, "\n\tscaled tree", Format (T1, 0, 1), "\n");
}
fprintf (stdout, "\n");
for (branch_scaler = 0.1; branch_scaler <= 2.1; branch_scaler += 0.1) {
fprintf(stdout, "logL (data | tree 2, (scaler = `Format (branch_scaler,3,1)`)) = ", fit2, "\n\tscaled tree", Format (T2, 0, 1), "\n");
}
// fix branch lengths
ReplicateConstraint ("this1.?.mu:=this2.?.mu__", T1, T1);
ReplicateConstraint ("this1.?.mu:=this2.?.mu__", T2, T2);
Optimize (mle1, fit1);
fprintf (stdout, "Optimal scaler for tree 1 is ", branch_scaler, "\n");
Optimize (mle2, fit2);
fprintf (stdout, "Optimal scaler for tree 2 is ", branch_scaler, "\n");
Best, Sergei
Dear team,
We are a bit confused about the way HyPhy computes likelihood scores. In order to examine it, we wanted to construct a very minimal example. In this example we have two tree with the same topology but one branch length differs between the trees. We thought of analyzing these trees using the JC model with a fixed rate parameter (no optimization). In theory, the two trees should yield different likelihood results for the two trees. However, we do not get this from HyPhy. Below is our example code. We would be happy to understand why it is so and how HyPhy represents branch lengths (time? time*expected_changes? Ignores them?).
For example, below is an R code to compute log-likelihoods of two trees under fixed parameters: `
example for log-likelihood computations given fixed parameters (no optimization):
library(phangorn)
two trees, same topology, one bl is different:
simpleTree1 = read.tree(text = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);") simpleTree2 = read.tree(text = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);")
same data:
msa = matrix(c("A","C","A","A")) row.names(msa) = c("A","B","C","D") myPhyDat = as.phyDat(x = msa, type = "DNA")
define likelihood computation objects on each tree and the data:
fit1 = pml(simpleTree1, data=myPhyDat) fit2 = pml(simpleTree2, data=myPhyDat)
mu = 1
fit1$Q fit2$Q
computed log-likelihoods:
fit1$logLik fit2$logLik `
This yields:
If we now try something similar with HBL, we do not get the same results for the two trees. We either get different log-likelihoods for each of the trees (but different from the R example). This happens when mu is local. Or, we get the same log-likelihoods for both trees, suggesting that the branch lengths are ignored (mu is global).
HBL code (global mu): `
// example for log-likelihood computations given fixed parameters (no optimization):
// two trees, same topology, one bl is different:
simpleTree1 = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);"; simpleTree2 = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);";
// same data
DataSet data = ReadDataFile ("./data.fas"); DataSetFilter dataFilter = CreateFilter (data,1);
global mu = 1; RateMatrix = {{,mu,mu,mu} {mu,,mu,mu} {mu,mu,,mu} {mu,mu,mu,}};
Freqs = {{0.25},{0.25},{0.25},{0.25}}; Model M = (RateMatrix, Freqs);
Tree T1 = simpleTree1; Tree T2 = simpleTree2;
// define likelihood computation objects on each tree and the data:
LikelihoodFunction fit1 = (dataFilter, T1); LikelihoodFunction fit2 = (dataFilter, T2);
fprintf(stdout, "fit1 model: ", M, "\n"); fprintf(stdout, "fit2 model: ", M, "\n");
// computed log-likelihoods:
LIKELIHOOD_FUNCTION_OUTPUT = 0; fprintf(stdout, "fit1: ", fit1, "\n"); fprintf(stdout, "fit2: ", fit2, "\n"); `
This yields: fit1: -5.2173711234725 fit2: -5.2173711234725
We are a bit lost here...
Thanks, Keren and Eli