veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
209 stars 69 forks source link

likelihood with fixed parameters and branch lengths #592

Closed halabikeren closed 6 years ago

halabikeren commented 7 years ago

Dear team,

We are a bit confused about the way HyPhy computes likelihood scores. In order to examine it, we wanted to construct a very minimal example. In this example we have two tree with the same topology but one branch length differs between the trees. We thought of analyzing these trees using the JC model with a fixed rate parameter (no optimization). In theory, the two trees should yield different likelihood results for the two trees. However, we do not get this from HyPhy. Below is our example code. We would be happy to understand why it is so and how HyPhy represents branch lengths (time? time*expected_changes? Ignores them?).

For example, below is an R code to compute log-likelihoods of two trees under fixed parameters: `

example for log-likelihood computations given fixed parameters (no optimization):

library(phangorn)

two trees, same topology, one bl is different:

simpleTree1 = read.tree(text = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);") simpleTree2 = read.tree(text = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);")

same data:

msa = matrix(c("A","C","A","A")) row.names(msa) = c("A","B","C","D") myPhyDat = as.phyDat(x = msa, type = "DNA")

define likelihood computation objects on each tree and the data:

fit1 = pml(simpleTree1, data=myPhyDat) fit2 = pml(simpleTree2, data=myPhyDat)

mu = 1

fit1$Q fit2$Q

computed log-likelihoods:

fit1$logLik fit2$logLik `

This yields:

fit1$logLik [1] -5.468459 fit2$logLik [1] -4.911909

If we now try something similar with HBL, we do not get the same results for the two trees. We either get different log-likelihoods for each of the trees (but different from the R example). This happens when mu is local. Or, we get the same log-likelihoods for both trees, suggesting that the branch lengths are ignored (mu is global).

HBL code (global mu): `

// example for log-likelihood computations given fixed parameters (no optimization):

// two trees, same topology, one bl is different:

simpleTree1 = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);"; simpleTree2 = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);";

// same data

DataSet data = ReadDataFile ("./data.fas"); DataSetFilter dataFilter = CreateFilter (data,1);

global mu = 1; RateMatrix = {{,mu,mu,mu} {mu,,mu,mu} {mu,mu,,mu} {mu,mu,mu,}};
Freqs = {{0.25},{0.25},{0.25},{0.25}}; Model M = (RateMatrix, Freqs);

Tree T1 = simpleTree1; Tree T2 = simpleTree2;

// define likelihood computation objects on each tree and the data:

LikelihoodFunction fit1 = (dataFilter, T1); LikelihoodFunction fit2 = (dataFilter, T2);

fprintf(stdout, "fit1 model: ", M, "\n"); fprintf(stdout, "fit2 model: ", M, "\n");

// computed log-likelihoods:

LIKELIHOOD_FUNCTION_OUTPUT = 0; fprintf(stdout, "fit1: ", fit1, "\n"); fprintf(stdout, "fit2: ", fit2, "\n"); `

This yields: fit1: -5.2173711234725 fit2: -5.2173711234725

We are a bit lost here...

Thanks, Keren and Eli

spond commented 7 years ago
// example for log-likelihood computations given fixed parameters (no optimization):

// two trees, same topology, one bl is different:

simpleTree1 = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);";
simpleTree2 = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);";

// same data

DataSet data = ReadFromString (
"
>A
A
>B
C
>C
A
>D
A
"
);

// SLKP if you don't set this flag, the branch lengths won't be 
// scaled as you expect them to be

AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1;

DataSetFilter dataFilter = CreateFilter (data,1);

RateMatrix = {{,mu,mu,mu}
{mu,,mu,mu}
{mu,mu,,mu}
{mu,mu,mu,}};
Freqs = {{0.25},{0.25},{0.25},{0.25}};
Model M = (RateMatrix, Freqs);

Tree T1 = simpleTree1;
Tree T2 = simpleTree2;

// define likelihood computation objects on each tree and the data:

LikelihoodFunction fit1 = (dataFilter, T1);
LikelihoodFunction fit2 = (dataFilter, T2);

fprintf(stdout, "fit1 model: ", M, "\n");
fprintf(stdout, "fit2 model: ", M, "\n");

// computed log-likelihoods:

LIKELIHOOD_FUNCTION_OUTPUT = 0;
fprintf(stdout, "fit1: ", fit1, "\n");
fprintf(stdout, "fit2: ", fit2, "\n");

Results in

fit1: -5.46845944689568
fit2: -4.91190852366769
halabikeren commented 7 years ago

Dear Sergei,

Thank you for the quick reply! We are now facing another challenge. When we set mu=2, we get the same log-likelihood scores as with mu=1.

Just to be specific, using your example, we added the line mu = 2.0; just before defining the rate matrix.

This resulted in: fit1: -5.46845944689568 fit2: -4.91190852366769

as before...

This doesn't happen in our previous R example so we are obviously lacking some knowledge about how HyPhy treats rates and branch lengths. Reverse engineering doesn't seem to help us too much...

It matters to us because eventually, we would like to optimize global parameters over a set of trees whose topologies and branch lengths are fixed.

Thanks once again, Keren and Eli.

spond commented 7 years ago

Dear @halabikeren,

In your particular example, mu is essentially the same as the time parameter. This value will be overwritten by HyPhy when it sets branch lengths. This amounts to solving

BranchLength (mu) = C, where
C is the user value

For JC69, BranchLength (mu) = 3/4 * mu

If you wish to scale the tree, while keeping relative branch lengths constant, do something like this

// example for log-likelihood computations given fixed parameters (no optimization):

// two trees, same topology, one bl is different:

simpleTree1 = "(A:0.2,B:0.1,(C:0.2,D:0.2):0.2);";
simpleTree2 = "(A:0.2,B:0.2,(C:0.2,D:0.2):0.2);";

// same data

DataSet data = ReadFromString (
"
>A
A
>B
C
>C
A
>D
A
"
);

// SLKP if you don't set this flag, the branch lengths won't be
// scaled as you expect them to be

AUTOMATICALLY_CONVERT_BRANCH_LENGTHS = 1;

DataSetFilter dataFilter = CreateFilter (data,1);

global branch_scaler = 1;

RateMatrix = {{,branch_scaler*mu,branch_scaler*mu,branch_scaler*mu}
{branch_scaler*mu,,branch_scaler*mu,branch_scaler*mu}
{branch_scaler*mu,branch_scaler*mu,,branch_scaler*mu}
{branch_scaler*mu,branch_scaler*mu,branch_scaler*mu,}};

Freqs = {{0.25},{0.25},{0.25},{0.25}};
Model M = (RateMatrix, Freqs);

Tree T1 = simpleTree1;
Tree T2 = simpleTree2;

// define likelihood computation objects on each tree and the data:

LikelihoodFunction fit1 = (dataFilter, T1);
LikelihoodFunction fit2 = (dataFilter, T2);

// computed log-likelihoods:

LIKELIHOOD_FUNCTION_OUTPUT = 0;

for (branch_scaler = 0.1; branch_scaler <= 2.1; branch_scaler += 0.1) {
    fprintf(stdout, "logL (data | tree 1, (scaler = `Format (branch_scaler,3,1)`)) = ", fit1, "\n\tscaled tree", Format (T1, 0, 1), "\n");
}

fprintf (stdout, "\n");

for (branch_scaler = 0.1; branch_scaler <= 2.1; branch_scaler += 0.1) {
    fprintf(stdout, "logL (data | tree 2, (scaler = `Format (branch_scaler,3,1)`)) = ", fit2,  "\n\tscaled tree", Format (T2, 0, 1), "\n");
}

// fix branch lengths

ReplicateConstraint ("this1.?.mu:=this2.?.mu__", T1, T1);
ReplicateConstraint ("this1.?.mu:=this2.?.mu__", T2, T2);

Optimize (mle1, fit1);
fprintf (stdout, "Optimal scaler for tree 1 is ", branch_scaler, "\n");

Optimize (mle2, fit2);
fprintf (stdout, "Optimal scaler for tree 2 is ", branch_scaler, "\n");

Best, Sergei