paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.28k stars 184 forks source link

Error in `loo()`, `pp_check()`: mgcv::PredictMat can't find by variable #1547

Closed grcatlin closed 11 months ago

grcatlin commented 1 year ago

Hi there,

Thanks for your quick response the other day and thanks again for this amazing package!

I believe I may be getting a partial presentation of #1473 though I can't be sure how related it is. On the latest CRAN & GitHub versions of brms, using loo() and pp_check() (and possibly more functions) on some models with splines with by = statements will cause the following error:

Error in mgcv::PredictMat(object, data = data, ...) : 
  Can't find by variable

From #1473, I'm guessing this is a cmdstan issue. Rolling back to brms 2.17.0 as suggested in that thread fixes the error.

remotes::install_github("paul-buerkner/brms@a43937c")

reprex

Here is a reprex, I tried to generate data similar to that I'm actually using and was successful in reproducing the error message.

# generate data ----

# seed
set.seed(0)

# response
y <- rpois(1000, 3) |>
    c(rep(0, 250)) |>
    sample()

# by variable
fac <- c("A", "B", "C") |> sample(length(y), replace = TRUE)

# gam sim
dat <- mgcv::gamSim(n = length(y)) |>
    # convert to data.table
    data.table::as.data.table() |>
    # keep x0, x1
    _[, .(x0, x1)]

# combine
dat <- dat |>
    # add in y, fac
    cbind(y, fac) |>
    # reorder
    data.table::setcolorder("y") |>
    # cut off y (mimics my data)
    _[y <= 6] |>
    # back to data.frame for reprex
    as.data.frame()

# model ----

# zip
fit <- brms::brm(
    y ~ 1 + s(x0, by = fac) + s(x1, by = fac),
    family = brms::zero_inflated_poisson(),
    data = dat,
    warmup = 500,
    iter = 1000,
    backend = "cmdstanr",
    silent = 0
)

# pp check (fails)
brms::pp_check(fit)

# conditional effects (works)
brms::conditional_effects(fit)

# loo (fails)
brms::loo(fit)

sessionInfo()

Fails:

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0

other attached packages:
[1] data.table_1.14.8 mgcv_1.9-0        nlme_3.1-163      cmdstanr_0.6.1   
[5] brms_2.20.5      Rcpp_1.0.11

> cmdstanr::cmdstan_version()
[1] "2.33.1"

Works:

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0

other attached packages:
[1] data.table_1.14.8 mgcv_1.9-0        nlme_3.1-163      cmdstanr_0.6.1   
[5] brms_2.17.0       Rcpp_1.0.11    

> cmdstanr::cmdstan_version()
[1] "2.33.1"
paul-buerkner commented 1 year ago

Does the same thing happen if you load brms via library(brms)?

paul-buerkner commented 1 year ago

Also, I cannot create the data in this form since on R 4.2 I get the error:

dat <- dat |>
  # add in y, fac
  cbind(y, fac) |>
  # reorder
  data.table::setcolorder("y") |>
  # cut off y (mimics my data)
  _[y <= 6] |>
  # back to data.frame for reprex
  as.data.frame()

Error: pipe placeholder can only be used as a named argument
grcatlin commented 1 year ago

Ah, apologies! This should work:

# alt dat gen ----

# seed
set.seed(0)

# response
y <- rpois(1000, 3) |>
    c(rep(0, 250)) |>
    sample()

# by variable
fac <- c("A", "B", "C") |> sample(length(y), replace = TRUE)

# gam sim
dat <- mgcv::gamSim(n = length(y)) |>
    dplyr::select(x0, x1)

# combine
dat <- dat |>
    # add in y, fac
    cbind(y, fac) |>
    # reorder
    dplyr::relocate(y) |>
    # cutoff y (mimics my data)
    dplyr::filter(y <= 6)

Loading library(brms) still produces the error.

# model ----

library(brms)

# zip
fit <- brms::brm(
    y ~ s(x0, by = fac) + s(x1, by = fac),
    family = brms::zero_inflated_poisson(),
    data = dat,
    warmup = 500,
    iter = 1000,
    backend = "cmdstanr",
    silent = 0
)

# pp check (fails)
brms::pp_check(fit)

# conditional effects (works)
brms::conditional_effects(fit)

# loo (fails)
brms::loo(fit)

Interestingly is if I'm using 2.17.0 and fit the above model where no errors for loo() and pp_check() occur, save that fit, restart R, install 2.20.5, restart R, read the fit in, and try either function I get the same error:

# fit between versions? ----

# install 2.17.0
remotes::install_github("paul-buerkner/brms@a43937c")

# restart R

# data & model
library(brms)
# data, model code above

# save fit generated by 2.17.0
saveRDS(fit, "./test_fit.rds")

# restart R

# install brms 2.20.5
remotes::install_github("paul-buerkner/brms")

# restart R

# load brms
library(brms)

# read in fit
fit <- readRDS("./test_fit.rds")

# loo (fails)
brms::loo(fit)
bentrueman commented 11 months ago

In case it's helpful, I was able to get brms::loo() and brms::pp_check() to work in the reprex above by converting fac to a factor:

# seed
set.seed(0)

# response
y <- rpois(1000, 3) |>
  c(rep(0, 250)) |>
  sample()

# by variable
fac <- c("A", "B", "C") |> sample(length(y), replace = TRUE)

# gam sim
dat <- mgcv::gamSim(n = length(y)) |>
  dplyr::select(x0, x1)

# combine
dat <- dat |>
  # add in y, fac
  cbind(y, fac) |>
  # reorder
  dplyr::relocate(y) |>
  # cutoff y (mimics my data)
  dplyr::filter(y <= 6) |>
  dplyr::mutate(fac = as.factor(fac)) # converting "fac" to a factor prevents the error

# zip
fit <- brms::brm(
  y ~ 1 + s(x0, by = fac) + s(x1, by = fac),
  family = brms::zero_inflated_poisson(),
  data = dat,
  warmup = 500,
  iter = 1000,
  backend = "cmdstanr",
  silent = 0
)

# pp check (works)
brms::pp_check(fit)


# loo (works)
brms::loo(fit)
#> 
#> Computed from 2000 by 1216 log-likelihood matrix
#> 
#>          Estimate   SE
#> elpd_loo  -2286.0 17.1
#> p_loo        16.5  0.7
#> looic      4572.1 34.2
#> ------
#> Monte Carlo SE of elpd_loo is 0.4.
#> 
#> Pareto k diagnostic values:
#>                          Count Pct.    Min. n_eff
#> (-Inf, 0.5]   (good)     1215  99.9%   16        
#>  (0.5, 0.7]   (ok)          1   0.1%   13        
#>    (0.7, 1]   (bad)         0   0.0%   <NA>      
#>    (1, Inf)   (very bad)    0   0.0%   <NA>      
#> 
#> All Pareto k estimates are ok (k < 0.7).
#> See help('pareto-k-diagnostic') for details.

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Halifax
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>   [1] gridExtra_2.3        inline_0.3.19        rlang_1.1.2         
#>   [4] magrittr_2.0.3       matrixStats_1.0.0    compiler_4.3.1      
#>   [7] mgcv_1.8-42          loo_2.6.0            callr_3.7.3         
#>  [10] vctrs_0.6.4          reshape2_1.4.4       stringr_1.5.0       
#>  [13] pkgconfig_2.0.3      crayon_1.5.2         fastmap_1.1.1       
#>  [16] backports_1.4.1      ellipsis_0.3.2       labeling_0.4.3      
#>  [19] utf8_1.2.4           threejs_0.3.3        cmdstanr_0.6.0      
#>  [22] promises_1.2.1       rmarkdown_2.25       markdown_1.11       
#>  [25] ps_1.7.5             purrr_1.0.2          xfun_0.41           
#>  [28] reprex_2.0.2         jsonlite_1.8.7       later_1.3.1         
#>  [31] styler_1.10.0        parallel_4.3.1       prettyunits_1.2.0   
#>  [34] R6_2.5.1             dygraphs_1.1.1.6     stringi_1.7.12      
#>  [37] StanHeaders_2.26.28  estimability_1.4.1   Rcpp_1.0.11         
#>  [40] rstan_2.32.3         knitr_1.45           zoo_1.8-12          
#>  [43] base64enc_0.1-3      R.utils_2.12.2       bayesplot_1.10.0    
#>  [46] httpuv_1.6.12        Matrix_1.5-4.1       splines_4.3.1       
#>  [49] R.cache_0.16.0       igraph_1.5.1         tidyselect_1.2.0    
#>  [52] rstudioapi_0.14      abind_1.4-5          yaml_2.3.7          
#>  [55] codetools_0.2-19     miniUI_0.1.1.1       curl_5.0.1          
#>  [58] processx_3.8.2       pkgbuild_1.4.2       lattice_0.21-8      
#>  [61] tibble_3.2.1         plyr_1.8.9           shiny_1.7.5.1       
#>  [64] withr_2.5.2          bridgesampling_1.1-2 posterior_1.5.0     
#>  [67] coda_0.19-4          evaluate_0.23        RcppParallel_5.1.7  
#>  [70] xts_0.13.1           pillar_1.9.0         tensorA_0.36.2      
#>  [73] checkmate_2.3.0      DT_0.30              stats4_4.3.1        
#>  [76] shinyjs_2.1.0        distributional_0.3.2 generics_0.1.3      
#>  [79] ggplot2_3.4.4        rstantools_2.3.1.1   munsell_0.5.0       
#>  [82] scales_1.2.1         gtools_3.9.4         xtable_1.8-4        
#>  [85] glue_1.6.2           emmeans_1.8.6        tools_4.3.1         
#>  [88] shinystan_2.6.0      data.table_1.14.8    colourpicker_1.3.0  
#>  [91] fs_1.6.3             mvtnorm_1.2-3        grid_4.3.1          
#>  [94] QuickJSR_1.0.7       crosstalk_1.2.0      colorspace_2.1-0    
#>  [97] nlme_3.1-162         cli_3.6.1            fansi_1.0.5         
#> [100] Brobdingnag_1.2-9    dplyr_1.1.3          V8_4.3.0            
#> [103] gtable_0.3.4         R.methodsS3_1.8.2    digest_0.6.33       
#> [106] brms_2.20.4          htmlwidgets_1.6.2    farver_2.1.1        
#> [109] htmltools_0.5.7      R.oo_1.25.0          lifecycle_1.0.3     
#> [112] mime_0.12            shinythemes_1.2.0

Created on 2023-11-07 with reprex v2.0.2

paul-buerkner commented 11 months ago

Now fixed on github.