pbs-assess / sdmTMB

:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
https://pbs-assess.github.io/sdmTMB/
182 stars 26 forks source link

Computational scaling with large datasets, especially for Tweedie #302

Closed seananderson closed 3 months ago

seananderson commented 5 months ago

See discussion here: https://github.com/pbs-assess/sdmTMB/discussions/301

My example code:

library(sdmTMB)
library(tictoc)

# match glmmTMB:
ctl <- sdmTMBcontrol(newton_loops = 0, multiphase = FALSE)

N <- 3e4
set.seed(1)
dat <- data.frame(
  y = fishMod::rTweedie(N, 1, 1, 1.5)
)

tic()
m2 <- sdmTMB(
  y ~ 1,
  data = dat,
  spatial = 'off',
  family = tweedie("log"),
  control = ctl
)
toc()
#> 6.429 sec elapsed

tic()
m1 <- glmmTMB::glmmTMB(
  y ~ 1,
  data = dat,
  family = glmmTMB::tweedie("log")
)
toc()
#> 1.433 sec elapsed

# Gaussian ----------------------------

N <- 3e4
set.seed(1)
dat <- data.frame(
  y = rnorm(N, 0, 1)
)

tic()
m2 <- sdmTMB(
  y ~ 1,
  data = dat,
  spatial = 'off',
  control = sdmTMBcontrol(newton_loops = 1, multiphase = FALSE)
)
toc()
#> 0.116 sec elapsed

tic()
m1 <- glmmTMB::glmmTMB(
  y ~ 1,
  data = dat
)
toc()
#> 0.219 sec elapsed

Created on 2024-02-15 with reprex v2.1.0

I'm not sure what's going on. My first thought was that maybe a single set of epsilon/omega random effects that are left mapped off at zero might be responsible, but when I try shrinking those to have dimensions of 0 it doesn't change much. I don't see a similar scaling problem with the Gaussian. Actually, maybe it's there but not until much larger data sizes and not as badly:

library(sdmTMB)
library(tictoc)

N <- 3e6
set.seed(2)
dat <- data.frame(
  y = rnorm(N, 0, 1)
)
ctl <- sdmTMBcontrol(newton_loops = 0, multiphase = FALSE)

tic()
m2 <- sdmTMB(
  y ~ 1,
  data = dat,
  spatial = 'off',
  control = ctl
)
#> Warning: The model may not have converged. Maximum final gradient:
#> 0.0594763641440051.
toc()
#> 13.274 sec elapsed

tic()
m1 <- glmmTMB::glmmTMB(
  y ~ 1,
  data = dat
)
toc()
#> 7.489 sec elapsed

Created on 2024-02-15 with reprex v2.1.0

There are gradient issues with the Gaussian big data example there.

I wonder if the optimizer settings are different? Or starting values are better? Or if an internal parameter transformation is different? Or if it's because of extra parameters that are mapped off?