lamho86 / phylolm

GNU General Public License v2.0
30 stars 12 forks source link

Alpha parameter >50 #28

Closed matthieu-haudiquet closed 3 years ago

matthieu-haudiquet commented 4 years ago

Hello,

Thank you for this great package ! I have read in several places ( e.g. : https://cran.r-project.org/web/packages/rr2/rr2.pdf ) that an alpha > 50 generally means there is no phylogenetic signal and thus you can (should?) use glm instead of phyloglm to perform the logistic regression.

Is this always correct and could you explain a little bit why ?

Best regards

cecileane commented 4 years ago

There is no such hard threshold, because it all depends on branch lengths. If you rescaled your tree to a total height of 1, then yes, you could use α>50 to mean "independent" residuals across species. But if your tree has a total height of 0.01, say, then α=50 would mean high phylogenetic correlation. Cooper et al. 2016 have a nice section "Recommendations for interpreting α". They recommend to compare the phylogenetic half-life ln(2)/α to the total height of your phylogeny.

matthieu-haudiquet commented 4 years ago

Thank you, that makes sense. It is much clearer now

arives commented 4 years ago

Matthieu,

This is a good point that you bring up, and a good point in Cecile's answer. The basic problem is that all optimizers have a difficult time finding maxima along boundaries. The form of the OU process in phylolm is parameterized so that alpha -> infinite as phylogenetic signal -> 0, so it is necessary to impose a boundary (alpha = 54, if I remember correctly). (Other formulations of the OU, like the one we used in Blomberg et al. 2003, use d = 1/alpha, but this formulation has exactly the same problem because it has a boundary at zero.) When hitting a boundary of no phylogenetic signal, it is best to refit the data with a model that does not include phylogenetic signal: glm() is the appropriate downgrade for phyloglm().

But Cecile is right to point out that alpha depends on branch lengths. The correct solution is always to transform branch lengths before doing a phylogenetic analysis. I typically standardize to make the determinant of the Brownian motion (starter tree) covariance matrix = 1, but it is also reasonable (and easier for phylo objects) to make the base to tip distance one. Also, it is best to make the phylogeny contemporaneous (ultrametric) for glms.

I'm not sure what the appropriate R etiquette is in a situation like this. Since rr2 is a "downstream" package from phylolm and other packages, I think Daijiang and I have to assume that users of our package are fitting models properly using the "upstream" functions. Maybe by that logic we shouldn't refit models for people when alpha > 50. On the other hand, that might stop people from making mistaken fitting, since I don't think you want to report the results from phylolm when alpha is at its maximum boundary.

Thanks again for point this out, and let me know what you think of my reasoning.

Cheers, Tony

cecileane commented 4 years ago

The form of the OU process in phylolm is parameterized so that alpha -> infinite as phylogenetic signal -> 0, so it is necessary to impose a boundary (alpha = 54, if I remember correctly)

The default upper bound on α, in phylolm, does depend on the tree height T: it's 50/T (from here). So if the tree was rescaled to a tree height of 1, the default upper bound would be 50 in phylolm. For phyloglm, the upper bound on is exp(4)/T = 54.6/T, like Tony said. The user does not need to rescale the tree: the software adjusts both the lower & upper bound on α accordingly. If either bound is hit, I believe that the user is given a warning, for the sake for full disclosure (but these warnings can reasonably be ignored since the defaults are already very wide).

The correct solution is always to transform branch lengths before doing a phylogenetic analysis.

I would not say "always". For example, one might be interested to know the effect of taxon sampling, or to repeat an analysis on various subclades. To be able to compare the analyses, it's best not to rescale the tree between resampling / subsetting strategies.

arives commented 4 years ago

Cecile,

Thanks for correcting me! I've always standardized branch lengths and then culled for alpha > 50; when doing lots of simulations, this is how I'd automate the identification of "lack of convergence" cases. I'll fix rr2.

Matthieu,

In case you are interested, here is code that illustrates why you should heed the warning in phyloglm:

n <- 500 phy <- compute.brlen(rtree(n=n), method = "Grafen", power = 1) phy2 <- phy phy2$edge.length <- 1000*phy2$edge.length

par(mfrow=c(1,2)) plot(phy, show.tip.label=F) add.scale.bar() plot(phy2, show.tip.label=F) add.scale.bar()

X1 <- rTraitCont(phy, model = "BM", sigma = 1) X1 <- (X1 - mean(X1))/var(X1) X = cbind(rep(1,n),X1)

sim.dat <- data.frame(Y=array(0, dim=n), X1=X1, row.names=phy$tip.label) sim.dat$Y <- rbinTrait(n=1, phy=phy, beta=c(0,0.5), alpha=100, X=X)

Compare with different branch lengths

summary(phyloglm(Y ~ X1, phy=phy, data=sim.dat)) summary(phyloglm(Y ~ X1, phy=phy2, data=sim.dat))

Cheers, Tony

arives commented 4 years ago

Cecile,

I just looked at this a little more, and I get cases when phyloglm gives a warning but the converge code is 0. Specifically, from the code in my last comment, you can get warnings and pretty different fits, but still have convergence codes of 0.

Would it be worth changing adding a "warning" code to phyloglm objects, or giving a non-zero convergence code under the same conditions that triggers the warning?

Cheers, Tony

lamho86 commented 4 years ago

Hello Tony,

I think this is a good idea! Thank you for your suggestion.

On the other hand, it is not strange that you can get convergence code of 0 and the warning at the same time. In this situation, the optimizer believes that the estimated alpha is the maximum of the likelihood function in the interval (0,54). On the other hand, the warning appears when the estimated alpha is close to 54. So, they are not mutually exclusive. When this happens, the estimated alpha will be something different from 54 but very close. You can think of this scenario as the likelihood function has a little bump near 54. However, the bump will be tiny since we know that the likelihood surface is quite flat near the boundary.

Best, Lam

Lam Ho | Assistant Professor FACULTY OF SCIENCE Department of Mathematics and Statistics 902.494.1069

Dalhousie University dal.ca

On Dec 4, 2019, at 9:20 AM, Anthony R. Ives notifications@github.com wrote:

Cecile,

I just looked at this a little more, and I get cases when phyloglm gives a warning but the converge code is 0. Specifically, from the code in my last comment, you can get warnings and pretty different fits, but still have convergence codes of 0.

Would it be worth changing adding a "warning" code to phyloglm objects, or giving a non-zero convergence code under the same conditions that triggers the warning?

Cheers, Tony

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lamho86/phylolm/issues/28?email_source=notifications&email_token=ABW3HP6KDCI7B3JMDC5J4ZLQW6VCBA5CNFSM4JTARZZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF472QQ#issuecomment-561642818, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABW3HP3BXSE73PUG7SQEA5DQW6VCBANCNFSM4JTARZZA.

arives commented 4 years ago

Lam,

I totally agree that it is not surprising. I was just surprised that the optimizer would give a convergence code of 0 when the estimates were still quite poor as flagged by the warning. In the example code I used, changing the scaling of the branch lengths could give pretty different estimates of the fixed effects (coefficients) while still giving convergence = 0 under both phylogenetic trees (original and rescaled).

It would be nice to have a simple way to flag these cases that I could integrate into rr2. I didn't want to condition refitting phyloglm as an glm on a warning that is not an attribute of the phyloglm object. Of course, there are two other solutions: (i) ignore the issue and leave it to the user to make sure the fit is good, or (ii) automatically refit phyloglm after rescaling the phylogenetic tree outside of phylglm.

I'm slowly (too slowly) learning how to make code/packages that work with different styles of analyses that people use. I do automatically rescale trees before analyses, even if the functions I use also rescale trees internally (like phylolm and functions that I've written). But obviously the packages have to be written for people that don't do this, and I need to be more aware of this.

Cheers, Tony

From: Lam Ho notifications@github.com Reply-To: lamho86/phylolm reply@reply.github.com Date: Wednesday, December 4, 2019 at 10:15 PM To: lamho86/phylolm phylolm@noreply.github.com Cc: "Anthony R. Ives" arives@wisc.edu, Comment comment@noreply.github.com Subject: Re: [lamho86/phylolm] Alpha parameter >50 (#28)

Hello Tony,

I think this is a good idea! Thank you for your suggestion.

On the other hand, it is not strange that you can get convergence code of 0 and the warning at the same time. In this situation, the optimizer believes that the estimated alpha is the maximum of the likelihood function in the interval (0,54). On the other hand, the warning appears when the estimated alpha is close to 54. So, they are not mutually exclusive. When this happens, the estimated alpha will be something different from 54 but very close. You can think of this scenario as the likelihood function has a little bump near 54. However, the bump will be tiny since we know that the likelihood surface is quite flat near the boundary.

Best, Lam

Lam Ho | Assistant Professor FACULTY OF SCIENCE Department of Mathematics and Statistics 902.494.1069

Dalhousie University dal.ca

On Dec 4, 2019, at 9:20 AM, Anthony R. Ives notifications@github.com wrote:

Cecile,

I just looked at this a little more, and I get cases when phyloglm gives a warning but the converge code is 0. Specifically, from the code in my last comment, you can get warnings and pretty different fits, but still have convergence codes of 0.

Would it be worth changing adding a "warning" code to phyloglm objects, or giving a non-zero convergence code under the same conditions that triggers the warning?

Cheers, Tony

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lamho86/phylolm/issues/28?email_source=notifications&email_token=ABW3HP6KDCI7B3JMDC5J4ZLQW6VCBA5CNFSM4JTARZZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF472QQ#issuecomment-561642818, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABW3HP3BXSE73PUG7SQEA5DQW6VCBANCNFSM4JTARZZA.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lamho86/phylolm/issues/28?email_source=notifications&email_token=ACYX6LDFSHVMB2DRJ7PGP53QXB56JA5CNFSM4JTARZZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF7NPPY#issuecomment-561960895, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACYX6LHJWD3HMFNSUIXPTITQXB56JANCNFSM4JTARZZA.

matthieu-haudiquet commented 4 years ago

Everyone,

Thank you Cecile, Anthony and Lam for the fruitful discussion. Especially I wasn't aware about the need to rescale the trees before such analysis.

Anthony thank you for the exemple, you are right

I agree that it would be very practical to have a flag when there is a warning about alpha being too close to its boundaries. Since I run a lot of phyloglm(), for now I just captured the warnings in R to flag it afterwards, but it would be nice if there was a flag so that I could just include a simple test in my functions. Would it be a good approach to increase the log.alpha.bound parameter until this warning (eventually) disappears ?

Best, Matthieu

arives commented 4 years ago

Matthieu,

I don't think you want to change the log.alpha.bound parameter boundary, since the way the function is coded (I know this now!) a value close to 54 on the rescaled tree is going to give essentially no signal, and hence optimization problems. When I get the warning, I always just revert to glm.

Cheers, Tony

matthieu-haudiquet commented 4 years ago

Ok, thanks !

lamho86 commented 4 years ago

Hello everyone,

I have added a flag (alphaWarn) in phyloglm for this warning. ‘0’ means good, ‘1’ means the estimate of alpha is near the lower bound, and ‘2’ means the estimate of alpha is near the upper bound.

Best, Lam

Lam Ho | Assistant Professor FACULTY OF SCIENCE Department of Mathematics and Statistics 902.494.1069

Dalhousie University dal.ca

On Dec 5, 2019, at 10:45 AM, Matthieu Haudiquet notifications@github.com wrote:

Ok, thanks !

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lamho86/phylolm/issues/28?email_source=notifications&email_token=ABW3HP2NP54CQIHPKTJUSZTQXEHYDA5CNFSM4JTARZZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGA6AFI#issuecomment-562159637, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABW3HP6DK7YD6BBHD57BIR3QXEHYDANCNFSM4JTARZZA.

arives commented 4 years ago

Lam,

That is really helpful. Thanks!

Cheers, Tony

From: Lam Ho notifications@github.com Reply-To: lamho86/phylolm reply@reply.github.com Date: Sunday, December 8, 2019 at 1:41 PM To: lamho86/phylolm phylolm@noreply.github.com Cc: "Anthony R. Ives" arives@wisc.edu, Comment comment@noreply.github.com Subject: Re: [lamho86/phylolm] Alpha parameter >50 (#28)

Hello everyone,

I have added a flag (alphaWarn) in phyloglm for this warning. ‘0’ means good, ‘1’ means the estimate of alpha is near the lower bound, and ‘2’ means the estimate of alpha is near the upper bound.

Best, Lam

Lam Ho | Assistant Professor FACULTY OF SCIENCE Department of Mathematics and Statistics 902.494.1069

Dalhousie University dal.ca

On Dec 5, 2019, at 10:45 AM, Matthieu Haudiquet notifications@github.com wrote:

Ok, thanks !

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lamho86/phylolm/issues/28?email_source=notifications&email_token=ABW3HP2NP54CQIHPKTJUSZTQXEHYDA5CNFSM4JTARZZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGA6AFI#issuecomment-562159637, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABW3HP6DK7YD6BBHD57BIR3QXEHYDANCNFSM4JTARZZA.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lamho86/phylolm/issues/28?email_source=notifications&email_token=ACYX6LFK6ZIVS6QEMZT6PEDQXVEXVA5CNFSM4JTARZZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGHHR5I#issuecomment-562985205, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACYX6LHFC54SCVI6WQWTLSDQXVEXVANCNFSM4JTARZZA.

matthieu-haudiquet commented 4 years ago

thanks a lot ! Matt

arives commented 4 years ago

Lam,

Just a quick question. I'm about to update rr2 on CRAN, and it would be nice to use your updates with alphaWarn handling. Do you plan to update your CRAN version in the near future?

Thanks, Tony

lamho86 commented 4 years ago

Hello Tony,

It has been a while since we last updated the CRAN version. I will push the current version to CRAN today.

arives commented 4 years ago

Thanks so much!

From: Lam Ho notifications@github.com Reply-To: lamho86/phylolm reply@reply.github.com Date: Wednesday, June 17, 2020 at 11:50 AM To: lamho86/phylolm phylolm@noreply.github.com Cc: "Anthony R. Ives" arives@wisc.edu, Comment comment@noreply.github.com Subject: Re: [lamho86/phylolm] Alpha parameter >50 (#28)

Hello Tony,

It has been a while since we last updated the CRAN version. I will push the current version to CRAN today.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/lamho86/phylolm/issues/28#issuecomment-645491753, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACYX6LASETE7MSZJYMYWDNDRXDX4LANCNFSM4JTARZZA.