For large random-effects models, obj$fn() and obj$gr() return NaN after many iterations

njhenry commented 3 years ago

Description:

For TMB models with large numbers of random effects, after calling the likelihood and gradient functions obj$fn() and obj$gr() many times, these functions will start to return NaN values given starting parameters that originally yielded valid results. I first encountered this behavior in a TMB model with 148,000 random effects, where this behavior consistently appeared after 142 iterations of my outer optimizer (both nlminb and L-BFGS-B, called using optimx), causing the optimization to fail. I found that when I saved the most recent fixed effects, obj$env$last.par.best[1:length(obj$par)], restarted my R session, assigned these values as the starting values for obj$par, and restarted the optimizer, the model would converge.

I have reproduced this behavior using a simple mixed-effects model below. My apologies if this is a problem with my implementation rather than a bug - I wasn't able to find any examples of this behavior on the Google Group or in previous issues, and I would appreciate any suggestions you might have for addressing it.

Reproducible Steps:

I reproduced the issue using a mixed-effects extension of the linreg.cpp template from the TMB book, with 1E6 data observations and 2E5 random effects. The objective function is created with random.start = expression(rep(0, length(random))) so that the starting values for both fixed and random effects remain constant across calls to the likelihood and gradient functions. I call obj$fn(obj$par) and obj$gr(obj$par) 2000 times, recording the results and execution time for each iteration.

library(TMB); library(glue); library(tools)

## 1) TMB C++ template
template_code <- "
  // linreg.cpp from TMB book + random effect and hyperprior
  #include <TMB.hpp>
  template<class Type>
  Type objective_function<Type>::operator() ()
  {
    DATA_VECTOR(Y); DATA_VECTOR(x); DATA_IVECTOR(group_ids);

    PARAMETER(a); PARAMETER(b); PARAMETER_VECTOR(group_res);
    PARAMETER(logSigma_group); PARAMETER(logSigma_obs);

    Type nll = 0.0;
    nll -= sum(dnorm(group_res, Type(0.0), exp(logSigma_group), true));
    for(int ii = 0; ii < x.size(); ii++){
      nll -= dnorm(
        Y(ii), a + b*x(ii) + group_res(group_ids(ii)), exp(logSigma_obs), true
      );
    }
    return nll;
  }
"

## 2) Define input data and starting parameters
set.seed(123)
x_vals <- 1:1e6
a <- 0; b <- 1E-3; sigma_obs <- 1; sigma_group <- .5
group_ids <- ceiling(x_vals/5)
num_groups <- max(group_ids)
group_res <- sigma_group * rnorm(num_groups)

data_list <- list(
  Y = a + b * x_vals + group_res[group_ids] + sigma_group * rnorm(length(x_vals)),
  x = x_vals,
  group_ids = group_ids - 1 # Convert from 1-index (R) to 0-index (C++)
)
params_list <- list(
  a = 0, b = 0, group_res = rep(0, num_groups), logSigma_group = 0, logSigma_obs = 0
)

## 3) Compile template and load ADFunction
fp <- tempfile(fileext = ".cpp"); writeLines(template_code, fp); TMB::compile(fp)
dyn.load(TMB::dynlib(tools::file_path_sans_ext(fp)))
obj <- TMB::MakeADFun(
  data = data_list, parameters = params_list, random = 'group_res',
  random.start = expression(rep(0, length(random))),
  DLL = basename(tools::file_path_sans_ext(fp)),
  inner.control = list(trace=TRUE, maxit=1000)
)
obj$env$tracemgc <- 1

## 4) Call obj$fn() and obj$gr() repeatedly using the same starting parameters
max_iterations <- 2000
fn_results <- gr_results <- fn_exec_time <- vector('list', length=max_iterations)
for(ii in 1:max_iterations){
  start_time <- proc.time()
  fn_results[[ii]] <- obj$fn(obj$par)
  gr_results[[ii]] <- obj$gr(obj$par)
  fn_exec_time[[ii]] <- proc.time() - start_time
  message(glue::glue(
    "--> Iter {ii}: {format(fn_results[[ii]], digits=10)}",
    "    Exec time: {format(fn_exec_time[[ii]][1], digits=5)}s",
    "==========================\n",
    .sep = '\n'
  ))
}

Expected Output:

For iterations 1-1340, the output looks as expected, with the same likelihood and gradient values returned across iterations:

iter: 1  value: 27779156995 mgc: 5005.364 ustep: 1 
iter: 2  mgc: 2.273737e-12 
iter: 1  value: 27779156995 mgc: 5005.364 ustep: 1 
iter: 2  mgc: 2.273737e-12 
outer mgc:  5.555575e+13 
--> Iter 2: 27779152383
    Exec time: 2.089s
==========================
iter: 1  value: 27779156995 mgc: 5005.364 ustep: 1 
iter: 2  mgc: 2.273737e-12 
iter: 1  value: 27779156995 mgc: 5005.364 ustep: 1 
iter: 2  mgc: 2.273737e-12 
outer mgc:  5.555575e+13 
--> Iter 3: 27779152383
    Exec time: 2.02s
==========================
...
> fn_results_vec <- unlist(lapply(fn_results, function(x) x[1]))
> range(fn_results_vec[1:1340])                                                           
[1] 27779152383 27779152383

For iterations 1-1340, the median execution time for each call to obj$fn(); obj$gr() was 2.23 seconds.

Current Output:

Starting at iteration 1341, calls to both obj$fn() and obj$gr() consistently return NaN:

iter: 1  value: NaN mgc: 5005.364 ustep: 0 
Error in par[random] <- par.random : replacement has length zero
iter: 1  value: NaN mgc: 5005.364 ustep: 0 
Error in par[random] <- par.random : replacement has length zero
outer mgc:  NaN 
--> Iter 1341: NaN
    Exec time: 72.445s
==========================
iter: 1  value: NaN mgc: 5005.364 ustep: 0 
Error in par[random] <- par.random : replacement has length zero
iter: 1  value: NaN mgc: 5005.364 ustep: 0 
Error in par[random] <- par.random : replacement has length zero
outer mgc:  NaN 
--> Iter 1342: NaN
    Exec time: 72.252s
==========================
...
> any(is.na(fn_results[1:1340]))                                                          
[1] FALSE
> any(is.na(gr_results[1:1340]))                                                          
[1] FALSE
> all(is.na(fn_results[1341:2000]))                                                       
[1] TRUE
> all(is.na(gr_results[1341:2000]))                                                       
[1] TRUE

For iterations 1341-2000, the median execution time for each loop iteration was 69.90 seconds.

TMB Version:

1.7.19

R Version:

4.0.3, installed and managed using conda 4.9.2 (linux-64 version)

Operating System:

CentOS Linux 7

kaskr commented 3 years ago

@njhenry Thanks for taking the time to file this accurate report.

Are you by any chance using Matrix package version 1.3-0, 1.3-1 or 1.3-2 ?

njhenry commented 3 years ago

Hi @kaskr, thanks for your response. I am using Matrix package version 1.3-2, last updated on 6 Jan 2021.

kaskr commented 3 years ago

You are probably seeing same issue as https://github.com/kaskr/adcomp/issues/340 and https://github.com/glmmTMB/glmmTMB/issues/665

Solution Upgrade TMB to version 1.7.20

njhenry commented 3 years ago

Fantastic—after updating TMB to 1.7.20, both my toy example and my original model finished successfully. Thanks very much for your help.

kaskr / adcomp