paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 182 forks source link

kfold returning elpd NAN #1516

Closed jscamac closed 1 year ago

jscamac commented 1 year ago

Hi I've successfully fitted two brms models, where I'm modelling canopy area as a function of different non-linear functions (an example is shown below). Each model has converged, with few (if any) divergent iterations, and each parameter has good effective sample sizes.

However, when I try to do a group split kfold validation I'm getting NANs (see below) and I'm not sure why.

packageVersion("brms")
[1] ‘2.18.0’

packageVersion("loo")
[1] ‘2.5.1’

packageVersion("cmdstanr")
[1] ‘0.5.3’

An example of the model I'm fitting looks like this

 out <- brms::brm(
                 bf(
                   Canopy ~ log(Asym/(1+ exp(-beta * (Growth_years - Tmax)))),
                   beta ~ 1 + (1|Scientific) + street_tree,
                   Tmax ~ 1 + (1|Scientific) + street_tree,
                   Asym ~ 1 + (1|Scientific) + street_tree,
                   nl = TRUE),
                 prior = 
                   prior(normal(200, 100),lb=0.001, nlpar ="Asym") + 
                   prior(normal(0.01,100), lb=0.001, nlpar="Tmax") +
                   prior(normal(0,1), lb = 0, nlpar="beta"),
                 control = list(adapt_delta = 0.99, max_treedepth = 15),
                 family = lognormal,
                 backend = "cmdstanr", 
                 threads = 2,
                 init = 0,
                 data = eucalyptus_growth_dat,
                 chains = 3,
                 cores = 3)

As stated above, the model seems to fit with minimal issues (apart from it taking an age to finish sampling; there is over 40,000 rows of data). Parameter estimates look reasonable. What I want to do is examine the predictive capacity between different model variants by using a split group kfold approach, where the interest is assessing how well the model predicts to withheld random effect groups. To do that I specify this using the following:

 brms::kfold(x = out, 
                      K = 5, 
                      folds = "grouped", 
                      group = "Scientific")

When I run the above, I get the following outcome with no error or warning messages:

Based on 5-fold cross-validation

           Estimate SE
elpd_kfold      NaN NA
p_kfold         NaN NA
kfoldic         NaN NA

What I think might be the issue is that there is considerable variability in the number of observations between each fold.

e.g.

    1     2     3     4     5 
 3084  2504 14119   986 20907 

I've looked at each of these folds and there appears to be reasonable variability in the other data input parameters (e.g. growth_years & street_tree). I've even managed to individually fit each of these fold subsets without running into convergence issues. I'm assuming that under the hood there is a problem with the level of unevenness among folds. Though when I've tried to replicate this unevenness in folds in mock datasets I'm not running into this issue. Unfortunately I can't share the data. Any tips or advice?

Oh and incase your interested this is the part of the pointwise samples

        elpd_kfold       p_kfold   kfoldic
    [1,]  -4.539832  3.795695e-01  9.079664
    [2,]  -4.605903  3.846158e-01  9.211806
    [3,]  -5.111670  2.462674e-01 10.223340
    [4,]  -5.059898  3.184598e-01 10.119795
    [5,]  -5.220611  2.450830e-01 10.441223
    [6,]  -5.311285  2.091175e-01 10.622569
    [7,]  -4.737715  4.543159e-01  9.475429
    [8,]  -4.828914  4.463188e-01  9.657829
    [9,]  -4.673482  4.691005e-01  9.346964
   [10,]  -5.298323  2.176782e-01 10.596645
   [11,]  -4.792010  5.030404e-01  9.584020
   [12,]  -4.881061  4.830266e-01  9.762122
   [13,]  -4.788471  5.238636e-01  9.576942
   [14,]  -4.787594  5.329466e-01  9.575187
   [15,]  -4.811155  5.378390e-01  9.622310
   [16,]  -4.788929  5.541308e-01  9.577859
   [17,]  -9.743180 -4.613748e+00 19.486359
   [18,] -10.342949 -4.544968e+00 20.685898
   [19,]  -4.615625 -3.565669e-01  9.231250
   [20,]  -3.992905  3.770351e-01  7.985810
   [21,]  -5.231787  2.939448e-01 10.463575
   [22,]  -5.240638  1.869140e-01 10.481277
   [23,]  -5.210137  2.735241e-01 10.420273
   [24,]  -4.907122  5.067121e-01  9.814244
   [25,]  -5.121841  3.715039e-01 10.243681
   [26,]  -4.253101  4.649373e-01  8.506202
   [27,]  -4.337621  5.543479e-01  8.675243
   [28,]  -4.392189  4.400904e-01  8.784379
   [29,]  -4.517743  5.398474e-01  9.035485
   [30,]  -3.475455  2.589708e-01  6.950910
   [31,]  -4.020786  5.061724e-02  8.041572
   [32,]  -4.666528  3.760590e-01  9.333056
   [33,]  -4.961022  3.242280e-01  9.922045
   [34,]  -4.521691  3.749092e-01  9.043382
   [35,]  -4.541400  3.485298e-01  9.082800
   [36,]  -4.621729  4.402935e-01  9.243458
   [37,]  -5.096600  3.269840e-01 10.193199
   [38,]  -4.766069  4.856513e-01  9.532138
   [39,]  -5.971669 -5.074343e-01 11.943338
   [40,]  -5.367172  7.109470e-02 10.734345
   [41,]  -4.992893  3.610960e-01  9.985787
   [42,]  -4.616530  1.087240e-01  9.233060
   [43,]  -5.919764 -1.681431e-01 11.839528
   [44,]  -6.410071 -4.884728e-01 12.820142
   [45,]  -6.070530 -2.773045e-01 12.141059
   [46,]  -6.957099 -8.632811e-01 13.914197
   [47,]        NaN           NaN       NaN
   [48,]        NaN           NaN       NaN
   [49,]        NaN           NaN       NaN
   [50,]        NaN           NaN       NaN
   [51,]        NaN           NaN       NaN
jscamac commented 1 year ago

Reposted this on Stan discourse as I'm not sure if this is a brms bug, an issue with the model, or potentially an issue with dependency packages such as loo, rstantools.

https://discourse.mc-stan.org/t/grouped-kfold-return-nan/31843/3

jscamac commented 1 year ago

This has been resolved. Mostly associated with weak priors on non-linear parameters that needed to be positive. Solution was to use the submodel to estimate the non-linear parameters on the log scale and exponentiate in the main model