pbs-assess / sdmTMB

:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
https://pbs-assess.github.io/sdmTMB/
186 stars 26 forks source link

NA/NaN gradient evaluation error encountered when running sdmTMB function with spatial `on` #288

Open davjfish opened 8 months ago

davjfish commented 8 months ago

When working through this demo on a new computer and a fresh install of R (4.3.2), we are running into the following issue:

library(ggplot2)
library(dplyr)
library(sdmTMB)

glimpse(pcod)
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)
plot(mesh)

m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = mesh, # can be omitted for a non-spatial model
  family = binomial(link = "logit"),
  spatial = "on"
)

Produces this error:

Error in stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  : 
  NA/NaN gradient evaluation
In addition: Warning message:
In stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  :
  NA/NaN function evaluation

When spatial is set to off, we do not get this error. Originally, we suspected this was a problem with running the library on Linux but we have since reproduced this on Windows. This error has also been reproduced on R version 4.2.2 . The error message is the same on Linux but we do receive a few extra warnings:

Error in stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  : 
  NA/NaN gradient evaluation
In addition: Warning messages:
1: In Cholesky(h.pattern, super = super) :
  Cholmod warning 'matrix not positive definite' at file ../Supernodal/t_cholmod_super_numeric.c, line 911
2: In stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr,  :
  NA/NaN function evaluation
3: In Cholesky(h.pattern, super = super) :
  Cholmod warning 'matrix not positive definite' at file ../Supernodal/t_cholmod_super_numeric.c, line 911
seananderson commented 8 months ago

This is likely due to this mismatch between your installed Matrix and the Matrix used to build the version on CRAN. It affects all TMB packages on CRAN. Install from source for now. We'll push a minor update to trigger a rebuild of the binary shortly.

davjfish commented 8 months ago

I wiped out all the installed packages and then ran this script on the linux box:

install.packages("Matrix", type = "source")
install.packages("TMB", type = "source")
install.packages("sdmTMB", type = "source")
install.packages("ggplot2")
install.packages("dplyr")

library(ggplot2)
library(dplyr)
library(sdmTMB)

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)

m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = mesh, # can be omitted for a non-spatial model
  family = binomial(link = "logit"),
  spatial = "on"
)

Same error but different warnings

image

seananderson commented 8 months ago

Have you restarted your R session to ensure the latest package installs are the ones loaded?

If that doesn't fix it, does a basic example with glmmTMB that has random effects run?

And if that works but sdmTMB doesn't, does the GitHub version work?

davjfish commented 8 months ago

I confirm that we have tried restarting the R session.

Here is the basic glmmTMB example we ran without any issue:

library(glmmTMB)
library(gamlss.dist)
dat <- data.frame(y =c(rZINBI(100, mu = 10, sigma = .6, nu=0.1),
                       rZINBI(100, mu = 5, sigma = .3, nu=.5)),
                  sites =c(rep("a", 100), rep("b", 100)),
                  year = rep(1:4, each = 10, times = 5),
                  trans = rep(1:40, each = 5, times = 1), 
                  area=rNO(200,20))

m1 <- glmmTMB(y ~ sites + (1|trans),
              zi=~0,
              family=nbinom1, data=dat)

Finally, we are getting the same result when installing the package directly from GitHub (R session was also restarted):

install.packages("Matrix", type = "source")
install.packages("TMB", type = "source")
install.packages("remotes")
remotes::install_github("pbs-assess/sdmTMB")
install.packages("ggplot2")
install.packages("dplyr")

library(ggplot2)
library(dplyr)
library(sdmTMB)

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)

m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = mesh, # can be omitted for a non-spatial model
  family = binomial(link = "logit"),
  spatial = "on"
)

image

seananderson commented 8 months ago

I'm running out of ideas. I've always seen the 'rebuilding from source with the latest Matrix version'-fix work.

Other information on the Matrix issue: https://github.com/glmmTMB/glmmTMB/issues/965 https://stat.ethz.ch/pipermail/r-package-devel/2023q4/010054.html https://stackoverflow.com/a/77504843

One other option would be to install an archived version of Matrix, such as version Matrix_1.6-1.1.tar.gz: https://cran.r-project.org/src/contrib/Archive/Matrix/ from before the ABI change.

install.packages("/path/to/downloads/Matrix_1.6-1.1.tar.gz", type  = "source", repos = NULL)

Restart R session, then try the binary version of sdmTMB

install.packages("sdmTMB")

I'll get a new version of sdmTMB on CRAN shortly, which should let the binary version work.

Otherwise, maybe it's something about your R algebra setup or C++ compiler Makevars? I don't see why glmmTMB would work and sdmTMB wouldn't, though, if both were built from source. The only thing I've seen cause this for models that should fit otherwise, is this Matrix issue.

Everything seems to be working across all tested systems with continuous integration, including that basic example.

If you post the output of sessionInfo() (after relevant packages are loaded), it's possible I can recreate it in Docker.

davjfish commented 8 months ago

Yeah, this is strange. It is surprising that the error was reproduced on our end across two separate installs (windows and ubuntu) and the unit tests are running fine.

I tried the above suggestion (i.e., installation of Matrix 1.6-1.1 from zipped tarball) and this did not work either.

Here is the output from sessionInfo:

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8   
 [6] LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sdmTMB_0.4.1  dplyr_1.1.4   ggplot2_3.4.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12        TMB_1.9.10         nloptr_2.0.3       pillar_1.9.0       compiler_4.2.2     class_7.3-20       tools_4.2.2       
 [8] boot_1.3-28        lme4_1.1-35.1      lifecycle_1.0.4    tibble_3.2.1       nlme_3.1-160       gtable_0.3.4       lattice_0.20-45   
[15] mgcv_1.8-41        pkgconfig_2.0.3    rlang_1.1.3        Matrix_1.6-1.1     cli_3.6.2          DBI_1.2.1          rstudioapi_0.15.0 
[22] mvtnorm_1.2-4      e1071_1.7-14       withr_2.5.2        fmesher_0.1.5      generics_0.1.3     vctrs_0.6.5        classInt_0.4-10   
[29] grid_4.2.2         tidyselect_1.2.0   glue_1.7.0         sf_1.0-15          R6_2.5.1           fansi_1.0.6        sp_2.1-2          
[36] minqa_1.2.6        magrittr_2.0.3     MASS_7.3-58.1      units_0.8-5        scales_1.3.0       emmeans_1.9.0      splines_4.2.2     
[43] assertthat_0.2.1   colorspace_2.1-0   xtable_1.8-4       KernSmooth_2.23-20 utf8_1.2.4         proxy_0.4-27       estimability_1.4.1
[50] munsell_0.5.0     

I'll also see if I can get some of my more R-savvy colleagues here at GFC to try and reproduce the issue.

seananderson commented 8 months ago

It's possible it's related to the libopenblasp here and the more usual Matrix version issue on the Windows machine. I believe I would have the same error on continuous integration without this line: https://github.com/pbs-assess/sdmTMB/blob/59e4072886be44a2848d3d6f0dd7d76c13c10109/.github/workflows/R-CMD-check.yaml#L87

Regardless, the best path forward is for me to bump the version on CRAN to build a new binary, which I will prioritize doing in the next day or so.

If that doesn't solve things, I'll fire up a Docker image and see if I can debug with that BLAS/LAPACK setup.

davjfish commented 8 months ago

Ok great. Thanks for your help with troubleshooting this.

seananderson commented 8 months ago

OK, version 0.4.2 is now on CRAN. The Mac binaries are built. The Windows binaries will probably be built in the next day or so. It occurs to me now that I don't know how Linux and CRAN interact. Maybe they don't build binaries for you?

davjfish commented 8 months ago

Sorry, still not working.

I tried it on a clean install and I installed the packages as such:

install.packages("ggplot2")
install.packages("dplyr")
install.packages("sdmTMB")

All of the packages are installed from source. I think you are correct that binaries are not built for Linux users; at least not with the way our machine is set up.

Here is the session info:

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8   
 [6] LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sdmTMB_0.4.2  dplyr_1.1.4   ggplot2_3.4.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12        TMB_1.9.10         nloptr_2.0.3       pillar_1.9.0       compiler_4.2.2     class_7.3-20       tools_4.2.2       
 [8] boot_1.3-28        lme4_1.1-35.1      lifecycle_1.0.4    tibble_3.2.1       nlme_3.1-160       gtable_0.3.4       lattice_0.20-45   
[15] mgcv_1.8-41        pkgconfig_2.0.3    rlang_1.1.3        Matrix_1.5-1       cli_3.6.2          DBI_1.2.1          e1071_1.7-14      
[22] withr_3.0.0        fmesher_0.1.5      generics_0.1.3     vctrs_0.6.5        classInt_0.4-10    grid_4.2.2         tidyselect_1.2.0  
[29] glue_1.7.0         sf_1.0-15          R6_2.5.1           fansi_1.0.6        sp_2.1-2           minqa_1.2.6        magrittr_2.0.3    
[36] MASS_7.3-58.1      scales_1.3.0       splines_4.2.2      units_0.8-5        assertthat_0.2.1   colorspace_2.1-0   utf8_1.2.4        
[43] KernSmooth_2.23-20 proxy_0.4-27       munsell_0.5.0     

I will try on my windows computer once the binaries are available.

davjfish commented 8 months ago

Fresh install on windows and I ran into the same error. I also had a colleague do this on their windows PC and they got the same error. We are both running R 4.2.2

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sdmTMB_0.4.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12        TMB_1.9.10         nloptr_2.0.3       pillar_1.9.0       compiler_4.2.2     class_7.3-20      
 [7] tools_4.2.2        boot_1.3-28        lme4_1.1-35.1      lifecycle_1.0.4    tibble_3.2.1       nlme_3.1-160      
[13] lattice_0.20-45    mgcv_1.8-41        pkgconfig_2.0.3    rlang_1.1.3        Matrix_1.5-1       DBI_1.2.1         
[19] cli_3.6.2          e1071_1.7-14       fmesher_0.1.5      dplyr_1.1.4        generics_0.1.3     vctrs_0.6.5       
[25] classInt_0.4-10    grid_4.2.2         tidyselect_1.2.0   glue_1.7.0         sf_1.0-15          R6_2.5.1          
[31] fansi_1.0.6        sp_2.1-2           minqa_1.2.6        magrittr_2.0.3     units_0.8-5        splines_4.2.2     
[37] MASS_7.3-58.1      assertthat_0.2.1   KernSmooth_2.23-20 utf8_1.2.4         proxy_0.4-27 
seananderson commented 8 months ago

I just confirmed that the following works on my DFO Windows laptop with several recent Matrix and TMB versions:

library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on"
)

but, the Matrix version above is very old (Matrix_1.5-1 2022-09-13) and may not be compatible with TMB 1.9.10 (depending on if it was built from source?). This breaking Matrix ABI change has been a big pain.

Can you confirm the following still does not work for you given current Matrix and TMB packages?

install.packages("Matrix")
install.packages("TMB")
install.packages("sdmTMB")

# restart R / RStudio to be safe... then

library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on"
)

CRAN checks seem fine and all binaries (except 'patched' linux) are built. Hopefully it's an issue with old Matrix...

davjfish commented 8 months ago

When I do the above, it works on the Windows computer! Unfortunately, still no luck on the Linux computer.

When my colleague first tried this on the DFO computer:

install.packages("sdmTMB")
library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on"
)

it worked because he had several dependencies already installed. However after wiping out the C:\Users\USER\AppData\Local\R\win-library\4.2 folder, it was only then that he got the famous error message.

JoleneSutton commented 7 months ago

Hi, just chiming in to add my support for finding a resolution to using sdmTMB on a Linux computer.

seananderson commented 7 months ago

@JoleneSutton can you provide more details? Installed from CRAN? Installed from source or binary? GitHub? Matrix and TMB up to date? Can you post the output of sessionInfo()? Anything in your R Makevars file?

There's nothing inherent to Linux systems about why this should happen. I regularly use the package on Linux systems, it's tested on 3 Linux systems with every push to GitHub, and the CRAN servers test it on many Linux systems.

I'd like to get to the bottom of this! It's likely something about a specific setup and maybe with multiple data points we can track this down.

JoleneSutton commented 7 months ago

Hi @seananderson , yes, sorry I should have been more clear. It is the same machine and thus error messages as described by @davjfish. I'm just hoping to be able to switch my scripts to that machine in order to free up my laptop. We still seem to be having issues with Linux, per the post from Jan. 22. Really appreciate all your help with this!

seananderson commented 7 months ago

I just spent a while debugging this with someone (with raw TMB/RTMB code, nothing to do with sdmTMB) who also had R version 4.2.2 installed and even installing Matrix and TMB from source in that order did not fix it (edit: it did fix it, but TMB had to built from source and R had to be restarted).

seananderson commented 7 months ago

It is still highly likely that the issue is an old Matrix package install. I see above that the installed version of Matrix is old. Current version is 1.6-5. Even for that person with R 4.2.2 I mentioned earlier today, once they installed the latest Matrix, then installed TMB from CRAN from source, the problem fixed itself. In this case (with an older R), you likely then also have to install sdmTMB from source. I can post some RTMB code that could be run to simplify testing a bit by eliminating the sdmTMB layer.

JoleneSutton commented 7 months ago

We upgraded to R 4.3.2 on the Linux, and installed the updated packages, but unfortunately we are still having the same issues.

Here's the code:

install.packages("Matrix")
install.packages("TMB")
install.packages("sdmTMB")
# restart R / RStudio to be safe... then
library(sdmTMB)
m <- sdmTMB(
  data = pcod,
  formula = present ~ depth_scaled + depth_scaled2,
  mesh = make_mesh(pcod, c("X", "Y"), cutoff = 10),
  family = binomial(link = "logit"),
  spatial = "on")

Here's the error message: Error in stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr, : NA/NaN gradient evaluation In addition: Warning messages: 1: In .local(A, ...) : CHOLMOD warning 'matrix not positive definite' at file '../Supernodal/t_cholmod_super_numeric.c', line 911 2: In stats::nlminb(start = tmb_obj$par, objective = tmb_obj$fn, gradient = tmb_obj$gr, : NA/NaN function evaluation 3: In .local(A, ...) : CHOLMOD warning 'matrix not positive definite' at file '../Supernodal/t_cholmod_super_numeric.c', line 911

And the session info:

sessionInfo() R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.4 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

time zone: America/Halifax tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] sdmTMB_0.4.2

loaded via a namespace (and not attached): [1] Matrix_1.6-5 dplyr_1.1.4 compiler_4.3.2 tidyselect_1.2.0
[5] Rcpp_1.0.12 assertthat_0.2.1 splines_4.3.2 boot_1.3-28.1
[9] lattice_0.21-9 R6_2.5.1 generics_0.1.3 classInt_0.4-10
[13] sf_1.0-15 MASS_7.3-60 tibble_3.2.1 nloptr_2.0.3
[17] fmesher_0.1.5 units_0.8-5 minqa_1.2.6 DBI_1.2.2
[21] TMB_1.9.10 pillar_1.9.0 rlang_1.1.3 utf8_1.2.4
[25] sp_2.1-3 cli_3.6.2 magrittr_2.0.3 mgcv_1.9-0
[29] class_7.3-22 grid_4.3.2 lme4_1.1-35.1 lifecycle_1.0.4
[33] nlme_3.1-163 vctrs_0.6.5 KernSmooth_2.23-22 proxy_0.4-27
[37] glue_1.7.0 fansi_1.0.6 e1071_1.7-14 tools_4.3.2
[41] pkgconfig_2.0.3

seananderson commented 7 months ago

I'm running out of ideas. You can confirm these built from source and were not installed from binaries?

install.packages("Matrix")
install.packages("TMB")
install.packages("sdmTMB")

I wondered if it could be the BLAS/LAPACK setup, but I just found someone with the same versions as you and it works for them. Again, you're sure the above installed from source?

As a troubleshooting exercise, does the following code run for you on this server down to the sdmTMB part? i.e., down to line 93 or so. https://github.com/seananderson/RTMB-TESA-spatial/blob/main/exercises/05-spatiotemporal-spde.R

Then we can isolate if this is an sdmTMB install issue or a more fundamental TMB issue.

stoyelq commented 4 months ago

This issue is still persistent on a fresh install in Ubuntu 22. I tried installing everything from source and ran into the same NA/Nan gradient / matrix not positive definite errors. I also tried a clean install duplicating the steps in the passing github action workflow without any luck.

I tried the troubleshooting exercise and it crashes out on line 90 with the same type of error:

> opt <- nlminb(obj$par, obj$fn, obj$gr)
Error in .local(A, ...) :
  leading principal minor of order 405 is not positive
In addition: Warning message:
In .local(A, ...) :
  CHOLMOD warning 'matrix not positive definite' at file 'Supernodal/t_cholmod_super_numeric_worker.c', line 1114
Error in .local(A, ...) :
  leading principal minor of order 405 is not positive
In addition: Warning messages:
1: In nlminb(obj$par, obj$fn, obj$gr) : NA/NaN function evaluation
2: In .local(A, ...) :
  CHOLMOD warning 'matrix not positive definite' at file 'Supernodal/t_cholmod_super_numeric_worker.c', line 1114
Error in ff(x, order = 1) :
  inner newton optimization failed during gradient calculation
outer mgc:  NaN
Error in nlminb(obj$par, obj$fn, obj$gr) : NA/NaN gradient evaluation
>
seananderson commented 4 months ago

@stoyelq is this on the same server as above or a different Ubuntu setup? If it's different then maybe we can figure out what's in common?

This shouldn't be a general problem with Ubuntu 22 + sdmTMB or Ubuntu + openBLAS + sdmTMB. Both are regularly tested and used without issue (here on GitHub Actions, on CRAN, by me personally, and by many others). There must be something about this specific system setup. Probably the best hope of solving this is with Docker. If someone can reproduce the problem on Docker and point me to the dockerfile then I can build it and troubleshoot.

It's also worth confirming if this is something unique to sdmTMB or if this happens with other TMB random effects models built locally. E.g., starting with a basic random effects model such as 'thetalog.R', and if that works, also trying an SPDE spatial model as in 'spde.R'. Both are in this examples folder: https://github.com/kaskr/adcomp/blob/master/tmb_examples/