glmmTMB / glmmTMB

glmmTMB
277 stars 56 forks source link

crash (memory allocation error) with odd Tweedie example #962

Closed bbolker closed 8 months ago

bbolker commented 8 months ago

This must be tickling some bug in our Tweedie code. From Stack Overflow, a Tweedie model fit that reliably crashes R with

cannot create std::vector larger than max_size()

set.seed(111)
dd <- data.frame( STAND = rep(LETTERS[1:4], c(5,3,6,4)),
   DATE = c("2022-01-01","2022-02-12","2022-03-01","2022-04-05",
    "2022-06-01","2022-01-01","2022-02-12","2022-03-01",
    "2022-01-01","2022-02-12","2022-03-01","2022-04-05",
    "2022-06-01","2022-06-20","2022-01-01","2022-02-12",
                              "2022-03-01","2022-04-05"))
    dd$B2_MAX <- runif(n=nrow(dd))
    dd_sub <- subset(dd, STAND == "D")
    dd_sub$DATE_TIME <-as.numeric(difftime(dd_sub$DATE, as.Date("2022-06-30"), units = "days"))

The next code chunk will crash your R session

glmmTMB(B2_MAX ~ poly(DATE_TIME,3) + 
                (1|DATE_TIME), data=dd_sub,
            family=tweedie(link = "log"))

Let's set up the model piecewise so that we can enable tracing from within TMB:

g0 <- glmmTMB(B2_MAX ~ poly(DATE_TIME,3) + 
            (1|DATE_TIME), data=dd_sub,
        family=tweedie(link = "log"),
        control = glmmTMBControl(optCtrl = list(trace = 1)),
        doFit = FALSE)
g1 <- fitTMB(g0, doOptim = FALSE)
g1$env$tracepar <- TRUE

Now run the optimization by hand. this also crashes

    with(g1, nlminb(par, fn, gradient = gr, control = list(trace = 1)))

The last parameters printed are:

       beta        beta        beta        beta       betad       theta 
 -1.1155322   1.1282692   0.3635176   0.9100788 -16.3756661 -10.2536210 
        psi 
 -2.2664551 

but just calling the function with these approximate (rounded) parameters is not close enough to cause the crash:

b3 <- c(-1.1155322, 1.1282692, 0.3635176, 0.9100788, -16.3756661, -10.2536210,  -2.2664551)
g1$fn(b3)

I can get better precision this way:

wfun <- function(par) {
    op <- options(digits = 22); on.exit(options(op))
    cat(par, "\n")
    g1$fn(par)
}
with(g1, nlminb(par, wfun, gradient = gr, control = list(trace = 1)))

but this doesn't help me identify "bad" parameters.

It seems like there's some kind of leak/state-dependence (i.e., the crash depends on the full trajectory of the optimization/history of calls to the objective function, not just the last one), or whether there is some delicate numerical criterion that causes a crash, and we are only approximately there ...

A couple more pieces of information:

       beta        beta        beta        beta       betad       theta 
 -1.1151115   1.1284065   0.3635438   0.9102538 -16.3497491 -10.2380749 
        psi 
 -2.2343998 

close to, but not identical to, the values from a previous run. Some kind of undefined behaviour (e.g. accessing an uninitialized memory location??)

@kaskr, sorry to bother you ... maybe the next step is to run this within valgrind etc. ? Here is the backtrace from running the code inside gdb: backtrace.txt

These seem to be the most useful bits ... ??

#14 std::vector<double, std::allocator<double> >::vector (this=<optimized out>, __n=<optimized out>, __a=...)
    at /usr/include/c++/11/bits/stl_vector.h:511
#15 0x00007fffd3a7f92b in atomic::tweedie_utils::tweedie_logW<double> (y=<optimized out>, phi=<optimized out>, 
    p=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:79
#16 0x00007fffd3a7fae2 in atomic::tweedie_logWEval<0, 3, 1, 9l>::operator()<double, double> (ty=0x5555612d3ee0, 
    tx=0x7ffffff9d050, this=<optimized out>) at /usr/local/lib/R/site-library/TMB/include/tiny_ad/atomic.hpp:62
#17 atomic::tweedie_logWOp<0, 3, 1, 9l>::Rf_eval<double, double> (ty=0x5555612d3ee0, tx=0x7ffffff9d050, 
    this=<optimized out>) at /usr/local/lib/R/site-library/TMB/include/tiny_ad/atomic.hpp:62
#18 atomic::tweedie_logWOp<0, 3, 1, 9l>::forward (args=..., this=<optimized out>)
    at /usr/local/lib/R/site-library/TMB/include/tiny_ad/atomic.hpp:62

I think if I were going to obsess further over this I would probably instrument the Tweedie code in TMB to tell me what size vector it thought it was allocating here; could nterms be a negative value that integer-overflows to something gigantic ... ???

kaskr commented 8 months ago

Thanks for the debugging @bbolker - your suspicion is correct. There's an integer overflow in jh=ceil(j) where jh is int and j=3012750346.5651932 is a double greater than the max representable.

bbolker commented 8 months ago

Thanks @kaskr!