lamho86 / phylolm

GNU General Public License v2.0
30 stars 12 forks source link

phyloglm troubleshooting #46

Closed jakeberv closed 2 years ago

jakeberv commented 2 years ago

Dear phylolm developers,

I am trying to optimize a phylogenetic logistic regression model using phyloglm and I'm getting an optimization error when using default search options:

Error in optim(par = c(startB, startlL), fn = npllh, method = "L-BFGS-B", : L-BFGS-B needs finite values of 'fn'

The model will run when I set the method to logistic_IG10, but I get "NaN" for the AIC and likelihood values.

Can you provide any advice as to how to troubleshoot this issue?

Thanks, Jacob Berv

cecileane commented 2 years ago

The search might go to the boundary for estimating one (or more) parameter, like ∞, or 0 (for α). Is it the case? Any more information you could provide? Can you try decreasing the tolerance btol for the coefficients, and/or decreasing the bound log.alpha.bound for α?

jakeberv commented 2 years ago

Ok -- I tried changing the defaults as such (I have to admit I do not really understand what these optimization parameters are doing):

phyloglm(formula, btol=100,log.alpha.bound=0)

And now I'm getting some new errors:

Warning messages: 1: In phyloglm(uncex.merged ~ (log(ages.age_since)) + log(discordance), : the estimate of 'alpha' (0.0200638066490299) reached the lower bound (0.0200638066490299). This may reflect a flat likelihood at low alpha values near the lower bound, meaning that the phylogenetic correlation is estimated to be maximal under the model in Ives and Garland (2010). 2: In phyloglm(uncex.merged ~ (log(ages.age_since)) + log(discordance), : the estimate of 'alpha' (0.0200638066490299) reached the upper bound (0.0200638066490299). This may simply reflect a flat likelihood at large alpha values, meaning that the phylogenetic correlation is estimated to be negligible.

Any ideas?

cecileane commented 2 years ago

Great, so it was indeed something about coefficients being driven to the boundary.

I suggest that you keep the default for btol (10): it means that the probabilities of "1" predicted by the model are constrained to lie within 1/(1 + e10) = 0.000045 and 1/(1 + e−10) = 0.999955. So I suggest keeping btol to its default, or set to some lower value (like 4) if the problem comes back. Increasing it to 100 may increase the risk that the convergence fails.

For α: the appropriate scale depends on the branch lengths. The algorithm calculates the mean distance from the root to tips, T, and then αT is constrained to lie within exp(±log.alpha.bound). By default, αT is constrained within exp(±4): [0.0183, 54.6]. Apparently, some of these values lead to an infinite likelihood (or precision issues) for your data. In your example with log.alpha.bound=0, you constrained αT to stay within [1,1], so you actually constrained α=1/T. It worked (the likelihood was calculated without issues), but it fixed α to a specific value that corresponds to a moderate level of phylogenetic correlation. The warnings are expected: they say that the estimated value of α is found to be equal to both the lower and the upper constraint.

So here I suggest that you set log.alpha.bound=1 for example, in the hope that the function will converge. And then increase log.alpha.bound progressively to get it as high as you can, while maintaining convergence. I hope that as you do so, the estimated coefficients will be stable. You will see if the estimated α hits the lower bound or the upper bound. If α wants to be really large, the correlation model corresponds to independent residuals. If α wants to be close to 0, the correlation model corresponds to a Brownian motion.

jakeberv commented 2 years ago

Hi Cécile, many thanks for your clarifications and suggestions--

I tried your idea and this does not seem to work for any values of log.alpha.bound that are not 0 (as well as btol = 10, as you suggest). I tried log.alpha.bound = 0.001 to 10 as well, and those values don’t seem to work. I get the original error back:

"Error in optim(par = c(startB, startlL), fn = npllh, method = "L-BFGS-B", : L-BFGS-B needs finite values of ‘fn’"

If the likelihood was calculated without issue when I set log.alpha.bound to zero, why does the model output NaN as the likelihood? My situation may be workable so long as I can extract the AIC (I am more interested in model ranks than parameter estimates, though of course it would be preferable to get some reliable estimate...).

It’s probably important to mention that I’m trying to do something a bit non-standard here. I am trying to model the probability of a binary event that occurs on internal nodes with respect to a continuous property of nodes (phylogenomic discordance). To do this, my thought was to graft zero-length branches to all internal nodes, to which I can then assign values for PCMs. This is a bit of a hacky solution, but it seems to work with regular PGLS for other models (and continuous traits).

Best, Jake

On Sep 6, 2021, at 9:16 PM, Cécile Ané @.***> wrote:

Great, so it was indeed something about coefficients being driven to the boundary.

I suggest that you keep the default for btol (10): it means that the probabilities of "1" predicted by the model are constrained to lie within 1/(1 + e10) = 0.000045 and 1/(1 + e−10) = 0.999955. So I suggest keeping btol to its default, or set to some lower value (like 4) if the problem comes back. Increasing it to 100 may increase the risk that the convergence fails.

For α: the appropriate scale depends on the branch lengths. The algorithm calculates the mean distance from the root to tips, T, and then αT is constrained to lie within exp(±log.alpha.bound). By default, αT is constrained within exp(±4): [0.0183, 54.6]. Apparently, some of these values lead to an infinite likelihood (or precision issues) for your data. In your example with log.alpha.bound=0, you constrained αT to stay within [1,1], so you actually constrained α=1/T. It worked (the likelihood was calculated without issues), but it fixed α to a specific value that corresponds to a moderate level of phylogenetic correlation. The warnings are expected: they say that the estimated value of α is found to be equal to both the lower and the upper constraint.

So here I suggest that you set log.alpha.bound=1 for example, in the hope that the function will converge. And then increase log.alpha.bound progressively to get it as high as you can, while maintaining convergence. I hope that as you do so, the estimated coefficients will be stable. You will see if the estimated α hits the lower bound or the upper bound. If α wants to be really large, the correlation model corresponds to independent residuals. If α wants to be close to 0, the correlation model corresponds to a Brownian motion.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lamho86/phylolm/issues/46#issuecomment-913923189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMB23X3GDKM3GHOUJNYZFDUAVRW7ANCNFSM5DRDDRDA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Powered by Mailbutler https://www.mailbutler.io/?utm_source=watermark&utm_medium=email&utm_campaign=watermark-essential-email, the email extension that does it all

cecileane commented 2 years ago

Lam (@lamho86): would you have any idea?

Your approach to include data at internal nodes is a nice trick. It will affect the estimated height of the tree (T above), which is calculated in the algorithm as the average length from the root to each tip. With your trick, the algorithm will get a value of T that is "too small" (based on how we would like to interpret it). So the default starting value for α (which is 1/T) might be too large.

Can you try setting start.alpha to something larger than the default? Based on the output you showed above, this default value seems to be 0.0200638066490299. Could you try with start.alpha=0.1 or even 1?

jakeberv commented 2 years ago

Something interesting: I tried re-generating my modified "trick" tree, but instead of grafting zero-length branches to internal nodes, I set the branch length > 0. For grafted branches set to 0.01 (time units), all of a sudden things start to work. So this must have something to do with the internal zero length branches. I'm now getting an optimized alpha value of ~0.5 and other parameters that seem sensible (with all others options set to default). This may be an ok workaround. The timescale I'm dealing with here is 100 Ma, so 0.01 time units is negligible. However -- it is not really exactly what I'm trying to do...

cecileane commented 2 years ago

great! So it may be related to the general problem caused by sister species, when both external edge lengths to these sisters have length 0: these sister species are at distance 0 from each other in the tree, and that causes numerical issues.

You might have this situation if some internal edge length is 0 (or like from a polytomy resolved with an internal edge of length 0). The internal nodes adjacent to an internal length of 0 would end up looking like sister species at distance 0 from each other when you graft the new tips with external edge of length 0. Not so with external edge of length >0. But perhaps you don't have this situation.

0.01 seems like tiny compare to 100, but you could try even smaller if that works.

lamho86 commented 2 years ago

My guess is the same as @cecileane. You might have some internal edges that are zero. If this is the case, setting a small values for these edge lengths may solve the problem. You can still keep the grafted edges 0.

On the other hand, it could be that 0-external edge lengths is the culprit. Then, the only solution is set a small value for the grafted edges. Have you tried method = logistic_MPLE?

jakeberv commented 2 years ago

Hi Lam, -- Yes the workaround with branch lengths set to 0.01 was using all defaults otherwise, including logistic_MPLE. I just tried with branch lengths set to 0.00001 and that also works. So that is definitely OK. Is it OK with both of you if I might eventually reference our exchange here as a personal communication?

cecileane commented 2 years ago

fine with me. you can actually cite this discussion's url, or the url of a particular comment.

jakeberv commented 2 years ago

Cool! Thanks for your help! J