Open runehaubo opened 5 years ago
Quick answer: I think we're evaluating the scaled gradients.
with(m0@optinfo$derivs, solve(chol(Hessian), gradient))
[1] -0.0005101626 -0.0000507276 0.0010575099
It is true that with default settings this doesn't get as close as the minqa bobyqa:
m1 <- update(m0, control=lmerControl(optimizer="bobyqa"))
with(m1@optinfo$derivs, solve(chol(Hessian), gradient))
[1] -1.159426e-06 1.100789e-08 -4.972763e-07
Lowering the tolerances does help:
m2 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy,
control=lmerControl(optimizer="nloptwrap", optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
> with(m2@optinfo$derivs, solve(chol(Hessian), gradient))
[1] 4.566692e-05 1.265605e-05 5.696546e-06
@runehaubo , did you have any further comments on this? We could change the default tolerances (coded here) for nloptwrap
to make them a little tighter ... ? (I'm not crazy about the way these are coded -- they can be overridden in optCtrl()
, but I might prefer global options ...)
I find it interesting, that nloptwrap only uses 36 feval
whereas bobyqa uses 98. "no wonder" it gets closer. I'm almost sure that the 'tol'erances have different meanings in nloptwrap and bobyqa ... and for that reason, yes, we should probably change the gradient tol for nloptwrap .... or we change the "wrap" part of nloptwrap such that the gradient tolerances do correspond.
@bbolker is of course right with the scale gradients we use ... but that is not documented properly, and scaled itself of course also needs to be specified careful, indeed in a way such that @runehaubo and notably less smart users can recompute these scaled gradients, using things such as numDeriv
plus parts of the lmer()
object.
Really, we should have updated the content of help(lmerControl)
much better with the change to "nloptwrap"
. Average Joe User has no chance to understand what's going on currently, notably the examples (and their comments) still assume "bobyqa"
default.
Then we also have ?nloptwrap
which is nice but also in its examples assume "bobyqa" default.
I'm happy we have the good ole' sleepstudy example. Should really see what's going on with convergence there.
Current feeling: We stay with "nloptwrap"
but must change some of its tuning parameters (or just the meaning of the gradient tolerance?) and really really should explain the tolerances we have available from nloptr
. Currently it's not even easy to find out how to let "nloptwrap" iterate longer. As there are simultaneously several convergence criteria, I'd also want the <obj>@optinfo
list to return information about which kind of convergence was achieved [well, for that we may have to talk to the nloptr
package authors, haven't tried finding out yet.
Thanks, Martin for your follow up one this one.
I think it would be great to have the documentation updated in light of the new optimizer. On the scaled versus absolute gradient, I think the reason I didn't think scaled was in part due to the warning saying max|grad|
in:
m <- lmer(Reaction ~ Days + -1 + (Days | Subject), sleepstudy)
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00416642 (tol = 0.002, component 1)
thinking that grad
means the gradient of the log-likelihood or deviance function wrt. the parameters and not the scaled ditto. Could this perhaps be changed to max|scaled_gradient|
or similar to indicate that this is not just the gradient?
Another thing is that ?lmerControl
contains
check.conv.grad = .makeCC("warning", tol = 1e-3, relTol = NULL)
describing tol
and relTol
only as numerical tolerances which is (1) inconsistent with tol = 0.002
and (2) at a first sight seems to indicate that the tolerance on the (unscaled/absolute) gradient is 1e-3
.
In relation to "what to check" I can only support checking the relative/scaled gradient as is currently being done, but I also think the unscaled/original/true gradient is a relevant variabel to check. I suppose you will mostly encounter the combination of a small scaled gradient but a non-small unscaled gradient when the precision is also non-small. You can say that it is not relevant to estimate the parameters accurately (in absolute terms) in such cases because the statistical uncertainty will be large anyway. On the other hand there may sometimes be reasons to fit such "over-parameterized" models. Some people like to "keep it maximal" and fit rather complex var-cor structures for which there is often not much support in the data. In other cases it is relevant to include model terms (fixed or random) due to the experimental design (e.g. split-plot or "classical" repeated measurements designs) to evaluate the relevant F-tests. In cases like these, I think it is relevant to require that even parameters with small statistical precision are still estimated accurately, ie., with a small unscaled gradient.
Cheers Rune
for the record, note (via e-mail) from Phillip Alday (in support of reducing default xtol_abs
?)
For psycholinguistics datasets (especially EEG and eye movements), I have noticed that nloptwrap sometimes reports convergence failure when "bobyqa" does not. I think the "problem" is with xtol_abs -- changing maxeval has no impact, but allowing for smaller x-step changes does. This is in itself perhaps indicative of other problems with the model and data, but could once again lead to previously "converged" models not converging in a new lme4 version. The non-converged nloptwrapr models are definitely faster to compute though.
I haven't tested this systematically (neither in terms of comparing fits of bobyqa vs non-converged vs converged nloptwrap nor in terms of datasets) and there is no free lunch when it comes to optimizers, but it might nonetheless be convenient to list some of these optimizer-specific parameters in ?convergence or ?lmerControl beyond their use in the examples. If not, then this message on a public mailing list might still be useful hint for posterity. :)
documentation improved (hopefully) in e762742 . @runehaubo , I see tol=2e-3
in the documentation, not tol=1e-3
?
This is another example where the new optimizer "fails" on a classical example:
## This is the triglyceride data from Kuehl (2000, Ex. 7.1)
y <- c(142.3, 144.0, 148.6, 146.9, 142.9, 147.4, 133.8, 133.2,
134.9, 146.3, 145.2, 146.3, 125.9, 127.6, 108.9, 107.5,
148.6, 156.5, 148.6, 153.1, 135.5, 138.9, 132.1, 149.7,
152.0, 151.4, 149.7, 152.0, 142.9, 142.3, 141.7, 141.2)
trigly <- data.frame(y = y,
day = factor(rep(1:4, each = 8)),
machine = factor(rep(rep(1:4, each = 2), 2)))
library(lme4)
fit.lme0 <- lmer(y ~ (1 | day) + (1 | machine) + (1 | machine:day),
data = trigly)
## Warning message:
## In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.002216 (tol = 0.002, component 1)
summary(fit.lme0)
## Random effects:
## Groups Name Variance Std.Dev.
## machine:day (Intercept) 34.69 5.890
## machine (Intercept) 57.77 7.601
## day (Intercept) 44.76 6.690
## Residual 17.90 4.230
fit.lme <- lmer(y ~ (1 | day) + (1 | machine) + (1 | machine:day),
control = lmerControl(optimizer = "bobyqa"),
data = trigly)
summary(fit.lme)
## Random effects:
## Groups Name Variance Std.Dev.
## machine:day (Intercept) 34.72 5.892
## machine (Intercept) 57.72 7.597
## day (Intercept) 44.69 6.685
## Residual 17.90 4.230
## Classical appraoch using methods of moments like estimators:
fit <- aov(y ~ day * machine, data = trigly)
MS <- summary(fit)[[1]][,"Mean Sq"]
## Residual:
MS[4]
## 17.89531
## machine:day
(MS[3] - MS[4]) / 2
## 34.72097
## machine
(MS[2] - MS[3]) / 8
## 57.71944
## day
(MS[1] - MS[3]) / 8
## 44.68549
"Fails" meaning a false-positive convergence warning? I agree that this is a case where bobyqa does better. More support also for changing the default xtol_abs
to 1e-8. Perhaps over-hastily, I've bitten the bullet and tightened the default tolerances to xtol_abs=1e-8, ftol_abs=1e-8
in the development version now.
This is from before that change.
Comparisons:
L0 <- lmerControl(
optimizer="nloptwrap",
optCtrl=list(xtol_abs=1e-6,ftol_abs=1e-6), ## original
check.conv.grad=.makeCC("warning", tol=0))
update(fit.lme0,control=L0) ## 2.2e-3
L1 <- lmerControl(
optimizer="bobyqa",
check.conv.grad=.makeCC("warning", tol=0))
fit.lme1 <- update(fit.lme0,control=L2) ## 2.819e-7
L2 <- lmerControl(
optimizer="nloptwrap",
optCtrl=list(xtol_abs=1e-8,ftol_abs=1e-8),
check.conv.grad=.makeCC("warning", tol=0))
fit.lme2 <- update(fit.lme0,control=L2) ## 1e-4
L3 <- lmerControl(
optimizer="nloptwrap",
optCtrl=list(xtol_abs=1e-8,ftol_abs=1e-6),
check.conv.grad=.makeCC("warning", tol=0))
fit.lme3 <- update(fit.lme0,control=L3) ## 2.2e-3
## speed? not much difference
library(rbenchmark)
benchmark(
nlorig=update(fit.lme0),
nl2=update(fit.lme0,
control=lmerControl(optimizer="nloptwrap",
optCtrl=list(xtol_abs=1e-8,ftol_abs=1e-8))),
bobyqa=update(fit.lme0,
control=lmerControl(optimizer="bobyqa")))
The source of the test errors seen in lmerTest with the new lme4 package on Windows are also seen on Mac and evolves around basically 3 model fits as described below.
The first fit is an old friend:
No warnings are issued, but looking at the gradients I'm wondering not?
As I read
?lmerControl
the defaulttol
ongrad
is2e-3
, so I should see a warning, no? Maybe its the computation of the gradient...which is very much the same...
Is the warning-system defunct?
A similar situation occurs with the following fit (statistical nonsense, but it merely illustrates a point):
Here I do get a warning, but the gradient in the reported warning is far from this:
and doesn't reflect how bad the optimizer actually did.
Another model where I see much the same problems is:
Here the gradients are all really far from zero contrary to what the warning message would have me believe. And it seems the gradients in
optinfo
are accurate enough:Note that the old-default optimizer did significantly better:
The last model that I saw problems with in the lmerTest test-suite was the following where much the same issue surfaced:
And again the old-default optimizer does much better:
Note, the test-suite in lmerTest doesn't actually test these gradients, but it does test that the right denominator
DF
are calculated for the Satterthwaite t- and F-tests, and since theseDF
s are ratios of gradients and Hessians these wildly inaccurate fits affect theDF
computations.All in all I'm not sure I prefer the new default over the old default optimizer and to be honest I think it's a problem that the user is not accurately notified when the optimizer does not terminate near the optimum. I haven't looked deeper into these things on the
lmer
side of things but thought I would report here as soon as possible.