Experiences with the new default lmer optimizer from lmerTest

The source of the test errors seen in lmerTest with the new lme4 package on Windows are also seen on Mac and evolves around basically 3 model fits as described below.

The first fit is an old friend:

library(lme4)
m0 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)

No warnings are issued, but looking at the gradients I'm wondering not?

m0@optinfo$derivs$gradient
[1] -0.005552805  0.007729215  0.028316701

As I read ?lmerControl the default tol on grad is 2e-3, so I should see a warning, no? Maybe its the computation of the gradient...

(g <- numDeriv::grad(func = update(m0, devFunOnly=TRUE),
                      x = m0@theta, method = "Richardson"))
[1] -0.005552651  0.007730982  0.028335562

which is very much the same...

max(abs(g)) > 2e-3
[1] TRUE

Is the warning-system defunct?

A similar situation occurs with the following fit (statistical nonsense, but it merely illustrates a point):

m <- lmer(Reaction ~ Days + -1 + (Days | Subject), sleepstudy)
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.00416642 (tol = 0.002, component 1)

Here I do get a warning, but the gradient in the reported warning is far from this:

m@optinfo$derivs$gradient
[1] -0.001290930 -0.005710308  0.025144469

and doesn't reflect how bad the optimizer actually did.

Another model where I see much the same problems is:

data("carrots", package = "lmerTest")
m1 <- lmer(Preference ~ sens1 + sens2 + (1 + sens1 + sens2 | Consumer) +
             (1 | Product), data=carrots)
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.00519832 (tol = 0.002, component 1)
m1@optinfo$derivs$gradient
[1] 0.12056390 0.03195508 0.09487334 0.26335720 0.15152168 0.06937019 0.13223993

Here the gradients are all really far from zero contrary to what the warning message would have me believe. And it seems the gradients in optinfo are accurate enough:

(g <- numDeriv::grad(func = update(m1, devFunOnly=TRUE),
                     x = m1@theta, method = "Richardson"))
[1] 0.12057208 0.03189465 0.09487857 0.26317500 0.15134787 0.06916931 0.13225885

Note that the old-default optimizer did significantly better:

m2 <- lmer(Preference ~ sens1 + sens2 + (1 + sens1 + sens2 | Consumer) +
                    (1 | Product), data=carrots,
                  control=lmerControl(optimizer="bobyqa"))
singular fit
m2@optinfo$derivs$gradient
[1] -2.626939e-04 -2.829267e-03 -1.023607e-03 -6.612026e-06  4.177809e-04  4.247840e-04
[7]  2.031697e-04

The last model that I saw problems with in the lmerTest test-suite was the following where much the same issue surfaced:

data("TVbo", package = "lmerTest")
> m3 <- lmer(Sharpnessofmovement ~ TVset * Picture + (1 | Assessor) +
+                 (1 | Assessor:TVset) + (1 | Assessor:Picture), data = TVbo)
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.00244572 (tol = 0.002, component 1)
> m3@optinfo$derivs$gradient
[1] 0.023146131 0.013681076 0.000035011

And again the old-default optimizer does much better:

> m3 <- lmer(Sharpnessofmovement ~ TVset * Picture + (1 | Assessor) +
+              (1 | Assessor:TVset) + (1 | Assessor:Picture), data = TVbo,
+            control=lmerControl(optimizer="bobyqa"))
> m3@optinfo$derivs$gradient
[1] -2.695515e-06  1.148237e-06  5.042011e-07

Note, the test-suite in lmerTest doesn't actually test these gradients, but it does test that the right denominator DF are calculated for the Satterthwaite t- and F-tests, and since these DFs are ratios of gradients and Hessians these wildly inaccurate fits affect the DF computations.

All in all I'm not sure I prefer the new default over the old default optimizer and to be honest I think it's a problem that the user is not accurately notified when the optimizer does not terminate near the optimum. I haven't looked deeper into these things on the lmer side of things but thought I would report here as soon as possible.

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lmerTest_3.1-0 lme4_1.1-20    Matrix_1.2-15  fortunes_1.5-4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        rstudioapi_0.9.0  magrittr_1.5      bindr_0.1.1      
 [5] splines_3.5.2     MASS_7.3-51.1     tidyselect_0.2.5  munsell_0.5.0    
 [9] colorspace_1.4-0  lattice_0.20-38   R6_2.3.0          rlang_0.3.1      
[13] minqa_1.2.4       dplyr_0.7.8       plyr_1.8.4        tools_3.5.2      
[17] grid_3.5.2        gtable_0.2.0      nlme_3.1-137      assertthat_0.2.0 
[21] yaml_2.2.0        lazyeval_0.2.1    tibble_2.0.1      numDeriv_2016.8-1
[25] crayon_1.3.4      bindrcpp_0.2.2    purrr_0.3.0       nloptr_1.2.1     
[29] ggplot2_3.1.0     glue_1.3.0        compiler_3.5.2    pillar_1.3.1     
[33] scales_1.0.0      pkgconfig_2.0.2

Quick answer: I think we're evaluating the scaled gradients.

with(m0@optinfo$derivs, solve(chol(Hessian), gradient))
[1] -0.0005101626 -0.0000507276  0.0010575099

It is true that with default settings this doesn't get as close as the minqa bobyqa:

m1 <- update(m0, control=lmerControl(optimizer="bobyqa"))
with(m1@optinfo$derivs, solve(chol(Hessian), gradient))
[1] -1.159426e-06  1.100789e-08 -4.972763e-07

Lowering the tolerances does help:

m2 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy, 
   control=lmerControl(optimizer="nloptwrap", optCtrl=list(xtol_abs=1e-8, ftol_abs=1e-8)))
> with(m2@optinfo$derivs, solve(chol(Hessian), gradient))
[1] 4.566692e-05 1.265605e-05 5.696546e-06

@runehaubo , did you have any further comments on this? We could change the default tolerances (coded here) for nloptwrap to make them a little tighter ... ? (I'm not crazy about the way these are coded -- they can be overridden in optCtrl(), but I might prefer global options ...)

I find it interesting, that nloptwrap only uses 36 feval whereas bobyqa uses 98. "no wonder" it gets closer. I'm almost sure that the 'tol'erances have different meanings in nloptwrap and bobyqa ... and for that reason, yes, we should probably change the gradient tol for nloptwrap .... or we change the "wrap" part of nloptwrap such that the gradient tolerances do correspond.

@bbolker is of course right with the scale gradients we use ... but that is not documented properly, and scaled itself of course also needs to be specified careful, indeed in a way such that @runehaubo and notably less smart users can recompute these scaled gradients, using things such as numDeriv plus parts of the lmer() object.

Really, we should have updated the content of help(lmerControl) much better with the change to "nloptwrap". Average Joe User has no chance to understand what's going on currently, notably the examples (and their comments) still assume "bobyqa" default.

Then we also have ?nloptwrap which is nice but also in its examples assume "bobyqa" default. I'm happy we have the good ole' sleepstudy example. Should really see what's going on with convergence there.

Current feeling: We stay with "nloptwrap" but must change some of its tuning parameters (or just the meaning of the gradient tolerance?) and really really should explain the tolerances we have available from nloptr. Currently it's not even easy to find out how to let "nloptwrap" iterate longer. As there are simultaneously several convergence criteria, I'd also want the <obj>@optinfo list to return information about which kind of convergence was achieved [well, for that we may have to talk to the nloptr package authors, haven't tried finding out yet.

Thanks, Martin for your follow up one this one.

I think it would be great to have the documentation updated in light of the new optimizer. On the scaled versus absolute gradient, I think the reason I didn't think scaled was in part due to the warning saying max|grad| in:

m <- lmer(Reaction ~ Days + -1 + (Days | Subject), sleepstudy)
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.00416642 (tol = 0.002, component 1)

thinking that grad means the gradient of the log-likelihood or deviance function wrt. the parameters and not the scaled ditto. Could this perhaps be changed to max|scaled_gradient| or similar to indicate that this is not just the gradient?

Another thing is that ?lmerControl contains

check.conv.grad = .makeCC("warning", tol = 1e-3, relTol = NULL)

describing tol and relTol only as numerical tolerances which is (1) inconsistent with tol = 0.002 and (2) at a first sight seems to indicate that the tolerance on the (unscaled/absolute) gradient is 1e-3.

In relation to "what to check" I can only support checking the relative/scaled gradient as is currently being done, but I also think the unscaled/original/true gradient is a relevant variabel to check. I suppose you will mostly encounter the combination of a small scaled gradient but a non-small unscaled gradient when the precision is also non-small. You can say that it is not relevant to estimate the parameters accurately (in absolute terms) in such cases because the statistical uncertainty will be large anyway. On the other hand there may sometimes be reasons to fit such "over-parameterized" models. Some people like to "keep it maximal" and fit rather complex var-cor structures for which there is often not much support in the data. In other cases it is relevant to include model terms (fixed or random) due to the experimental design (e.g. split-plot or "classical" repeated measurements designs) to evaluate the relevant F-tests. In cases like these, I think it is relevant to require that even parameters with small statistical precision are still estimated accurately, ie., with a small unscaled gradient.

Cheers Rune

for the record, note (via e-mail) from Phillip Alday (in support of reducing default xtol_abs ?)

For psycholinguistics datasets (especially EEG and eye movements), I have noticed that nloptwrap sometimes reports convergence failure when "bobyqa" does not. I think the "problem" is with xtol_abs -- changing maxeval has no impact, but allowing for smaller x-step changes does. This is in itself perhaps indicative of other problems with the model and data, but could once again lead to previously "converged" models not converging in a new lme4 version. The non-converged nloptwrapr models are definitely faster to compute though.

I haven't tested this systematically (neither in terms of comparing fits of bobyqa vs non-converged vs converged nloptwrap nor in terms of datasets) and there is no free lunch when it comes to optimizers, but it might nonetheless be convenient to list some of these optimizer-specific parameters in ?convergence or ?lmerControl beyond their use in the examples. If not, then this message on a public mailing list might still be useful hint for posterity. :)

documentation improved (hopefully) in e762742 . @runehaubo , I see tol=2e-3 in the documentation, not tol=1e-3 ?

This is another example where the new optimizer "fails" on a classical example:

## This is the triglyceride data from Kuehl (2000, Ex. 7.1)

y <- c(142.3, 144.0, 148.6, 146.9, 142.9, 147.4, 133.8, 133.2,
       134.9, 146.3, 145.2, 146.3, 125.9, 127.6, 108.9, 107.5,
       148.6, 156.5, 148.6, 153.1, 135.5, 138.9, 132.1, 149.7,
       152.0, 151.4, 149.7, 152.0, 142.9, 142.3, 141.7, 141.2)

trigly <- data.frame(y = y,
                     day = factor(rep(1:4, each = 8)),
                     machine = factor(rep(rep(1:4, each = 2), 2)))
library(lme4)
fit.lme0 <- lmer(y ~ (1 | day) + (1 | machine) + (1 | machine:day), 
                data = trigly)
## Warning message:
##   In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
##                  Model failed to converge with max|grad| = 0.002216 (tol = 0.002, component 1)

summary(fit.lme0)
## Random effects:
##   Groups      Name        Variance Std.Dev.
## machine:day (Intercept) 34.69    5.890   
## machine     (Intercept) 57.77    7.601   
## day         (Intercept) 44.76    6.690   
## Residual                17.90    4.230   

fit.lme <- lmer(y ~ (1 | day) + (1 | machine) + (1 | machine:day), 
                control = lmerControl(optimizer = "bobyqa"), 
                data = trigly)
summary(fit.lme)
## Random effects:
##   Groups      Name        Variance Std.Dev.
## machine:day (Intercept) 34.72    5.892   
## machine     (Intercept) 57.72    7.597   
## day         (Intercept) 44.69    6.685   
## Residual                17.90    4.230  

## Classical appraoch using methods of moments like estimators:
fit <- aov(y ~ day * machine, data = trigly)
MS <- summary(fit)[[1]][,"Mean Sq"]

## Residual:
MS[4]
## 17.89531

## machine:day
(MS[3] - MS[4]) / 2
## 34.72097

## machine
(MS[2] - MS[3]) / 8
## 57.71944

## day
(MS[1] - MS[3]) / 8
## 44.68549

"Fails" meaning a false-positive convergence warning? I agree that this is a case where bobyqa does better. More support also for changing the default xtol_abs to 1e-8. Perhaps over-hastily, I've bitten the bullet and tightened the default tolerances to xtol_abs=1e-8, ftol_abs=1e-8 in the development version now.

This is from before that change.

Comparisons:

L0 <- lmerControl(
    optimizer="nloptwrap",
    optCtrl=list(xtol_abs=1e-6,ftol_abs=1e-6), ## original
    check.conv.grad=.makeCC("warning", tol=0))
update(fit.lme0,control=L0) ## 2.2e-3
L1 <- lmerControl(
    optimizer="bobyqa",
    check.conv.grad=.makeCC("warning", tol=0))
fit.lme1 <- update(fit.lme0,control=L2)  ## 2.819e-7
L2 <- lmerControl(
    optimizer="nloptwrap",
    optCtrl=list(xtol_abs=1e-8,ftol_abs=1e-8),
    check.conv.grad=.makeCC("warning", tol=0))
fit.lme2 <- update(fit.lme0,control=L2)  ## 1e-4
L3 <- lmerControl(
    optimizer="nloptwrap",
    optCtrl=list(xtol_abs=1e-8,ftol_abs=1e-6),
    check.conv.grad=.makeCC("warning", tol=0))
fit.lme3 <- update(fit.lme0,control=L3)  ## 2.2e-3

## speed? not much difference
library(rbenchmark)
benchmark(
    nlorig=update(fit.lme0),
    nl2=update(fit.lme0,
                 control=lmerControl(optimizer="nloptwrap",
                           optCtrl=list(xtol_abs=1e-8,ftol_abs=1e-8))),
    bobyqa=update(fit.lme0,
                 control=lmerControl(optimizer="bobyqa")))

lme4 / lme4

Experiences with the new default lmer optimizer from lmerTest #501