Export all the required function for multi-core loo

paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan

https://paul-buerkner.github.io/brms/

GNU General Public License v2.0

1.27k stars 182 forks source link

Export all the required function for multi-core loo #1225

Closed zfarley closed 2 years ago

zfarley commented 3 years ago

Hi,

This is my first post here so my apologies if I miss something. Windows 10, brms 2.16.1, loo 2.4.1, R 4.0.3, Rstudio 1.4.1106

@martinmodrak suggested creating this post after our discussion here: https://discourse.mc-stan.org/t/brms-loo-moment-match-t-does-not-appear-to-work/24013/6.

When setting moment_match = TRUE with either loo or add_criterion with a brms model with family = "dirichlet", I get the following error (untested for other families):

Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: could not find function "bind" Error: Moment matching failed. Perhaps you did not set 'save_pars = save_pars(all = TRUE)' when fitting your model?

Here are my sources for creating the bind function: https://discourse.mc-stan.org/t/dirichlet-regresion-using-brms/8591/3 https://github.com/paul-buerkner/brms/issues/1016

As well as a past post regarding issues with moment_match and save_pars being resolved: https://discourse.mc-stan.org/t/error-with-loo-moment-match/18914/18

Here is some reproducible code:

library(DirichletReg)
library(brms)

# Generate values from dirichlet distribution
a <- as.data.frame(rdirichlet(100, c(5,5,10,2,1)))

# Create a predictor variable
set.seed(1234)
a$P1 <- sample(1:100, 100, replace = T)

# Create a grouping variable
a$G1 <- 1:100

# Create bind function
bind <- cbind

# Fit model - run time ~ 1 minute
fit <- brm(bind(V1, V2, V3, V4, V5) ~ P1 + (1|G1), data = a, seed = 1234, iter = 1000,
           chains = 4, cores = 4, family = "dirichlet", save_pars = save_pars(all = TRUE))

# Check loo, 6 bad pareto-k values
loo(fit, cores = 4)

# Check loo with moment matching
# Produces error:
# Error in checkForRemoteErrors(val) : 
#  4 nodes produced errors; first error: could not find function "bind"
# Error: Moment matching failed. Perhaps you did not set 'save_pars = save_pars(all = TRUE)' when fitting your model?
loo(fit, moment_match = T, cores = 4)

# Attempt with add_criterion
# Same error
fit2 <- add_criterion(fit, "loo", moment_match = T, cores = 4)

Any help would be greatly appreciated. Thanks

paul-buerkner commented 3 years ago

without parallelization is should work that is set cores = 1 in add_criterion. exporting all the required functions is quite difficult and I am not sure how to solve this in general.

zfarley @.***> schrieb am Fr., 3. Sept. 2021, 21:14:

Hi,

This is my first post here so my apologies if I miss something. Windows 10, brms 2.16.1, loo 2.4.1, R 4.0.3, Rstudio 1.4.1106

@martinmodrak https://github.com/martinmodrak suggested creating this post after our discussion here: https://discourse.mc-stan.org/t/brms-loo-moment-match-t-does-not-appear-to-work/24013/6 .

When setting moment_match = TRUE with either loo or add_criterion with a brms model with family = "dirichlet", I get the following error (untested for other families):

Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: could not find function "bind" Error: Moment matching failed. Perhaps you did not set 'save_pars = save_pars(all = TRUE)' when fitting your model?

Here are my sources for creating the bind function: https://discourse.mc-stan.org/t/dirichlet-regresion-using-brms/8591/3

1016 https://github.com/paul-buerkner/brms/issues/1016

As well as a past post regarding issues with moment_match and save_pars being resolved: https://discourse.mc-stan.org/t/error-with-loo-moment-match/18914/18

Here is some reproducible code:

library(DirichletReg) library(brms)

Generate values from dirichlet distribution

a <- as.data.frame(rdirichlet(100, c(5,5,10,2,1)))

Create a predictor variable

set.seed(1234) a$P1 <- sample(1:100, 100, replace = T)

Create a grouping variable

a$G1 <- 1:100

Create bind function

bind <- cbind

Fit model - run time ~ 1 minute

fit <- brm(bind(V1, V2, V3, V4, V5) ~ P1 + (1|G1), data = a, seed = 1234, iter = 1000, chains = 4, cores = 4, family = "dirichlet", save_pars = save_pars(all = TRUE))

Check loo, 6 bad pareto-k values

loo(fit, cores = 4)

Check loo with moment matching

Produces error:

Error in checkForRemoteErrors(val) :

4 nodes produced errors; first error: could not find function "bind"

Error: Moment matching failed. Perhaps you did not set 'save_pars = save_pars(all = TRUE)' when fitting your model?

loo(fit, moment_match = T, cores = 4)

Attempt with add_criterion

Same error

fit2 <- add_criterion(fit, "loo", moment_match = T, cores = 4)

Any help would be greatly appreciated. Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/1225, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AGSHYQD3P5TEOMX4UDUAEGAXANCNFSM5DMHSRDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

zfarley commented 3 years ago

Thank you for the help.

I have been able to successfully apply moment_match = T to the test dataset by not using parallelization.

Where I am now having trouble is with past models I have run which I have already added loo (and waic) without moment_match = T and saved as .rds files via the following code (picking up from the initial post):

fit <- add_criterion(fit, c("loo", "waic"), cores = 4)
saveRDS(fit, "fit.rds")

I then read the models back in (after having cleared, closed, and restarted R days later):

fit <- readRDS("fit.rds")

# without overwrite = T, loo does not change
new_fit <- add_criterion(fit, "loo", overwrite = T, moment_match = T)

This produces the following error: Error in private:public : argument of length 0

Please let me know if providing a model that produces this error would be helpful.

Thank you

paul-buerkner commented 3 years ago

I have no idea what this error means. Still happens with cores = 1? If yes, then a reprex (oder small fitted model that reproduces this) could help.

zfarley commented 3 years ago

So I uninstalled both R (updated to 4.1.1) and Rstudio and reinstalled.

I am still getting that error for attached model Test1.rds

Error in private:public : argument of length 0

I am also getting this error with a different model (Test2.rds, attached):

Error in object@.MISC$stan_fit_instance$log_prob(upars, adjust_transform,  : 
  Exception: Exception: dirichlet_lpmf: prior sample sizes[5] is 0, but must be > 0!  (in 'model418081d3bea_45d1ae0d89d3bd9953f51a741c66973d' at line 12)
  (in 'model418081d3bea_45d1ae0d89d3bd9953f51a741c66973d' at line 145)

Error: Moment matching failed. Perhaps you did not set 'save_pars = save_pars(all = TRUE)' when fitting your model?

In both models, I fit them, added both loo (without moment_match = T) and waic, then saved them. I then read them back in and attempted to:

fit2 <- brms::add_criterion(fit, "loo", overwrite = T, moment_match = T)

This produces the errors above.

In order to assess if there was an issue with either the model being read into R or how it had been fit previously, I re-ran them and tried to add loo with moment_match = T directly after fitting (cores set to default). This produced the same errors.

Here are my model specifications if it helps

  Test1 <- brms::brm(bind(Cat1, Cat2, Cat3, Cat4, Cat5) ~  P1 + (1|G1) + (1|G2),
                             data = Data1, seed = 1234, control = list(max_treedepth = 15, adapt_delta = 0.99), iter = 8000,
                             chains = 4, cores = 4, family = "dirichlet", save_pars = save_pars(all = TRUE))

  Test2 <- brms::brm(bind(Cat1, Cat2, Cat3, Cat4, Cat5) ~  P1 + (1|G1),
                             data = Data2, seed = 1234, control = list(max_treedepth = 15, adapt_delta = 0.99), iter = 8000,
                             chains = 4, cores = 4, family = "dirichlet", save_pars = save_pars(all = TRUE))

I am trying to attach the models but .rds are not supported. I tried putting them in a zip file, no luck. What is the best way to get you the models?

Thank you for your help.

zfarley commented 3 years ago

Just as an update, I tried finding more about the second error indicating a prior sample size was 0.

Based on these posts: https://discourse.mc-stan.org/t/initializing-simplex-dirichlet/6289 https://discourse.mc-stan.org/t/simplex-regression/1560

I double checked I do not have any true zeros in any of my response categories which I do not.

This vignette on the Dirichlet distribution tells me where the dirichlet_lpmf is derived from but unfortunately, I do not know how to access the prior sample sizes or why they would be 0. https://mc-stan.org/docs/2_27/functions-reference/dirichlet-distribution.html

I tried the following which did not find any values equal to 0:

subset(as_draws_df(Test2)[5],  as_draws_df(Test2)[5] == 0)

I am unsure whether this is related to sampling or something in my data.

zfarley commented 3 years ago

@paul-buerkner is there anything else I can do on my end to try and resolve this?

I very much appreciate your time and help.

paul-buerkner commented 3 years ago

I am not sure at this point. Seems like a general problem that I would hope has a general fix but unfortunately I am not aware of one at the moment.

zfarley commented 2 years ago

@paul-buerkner Would getting you my models help the process at all? Do you have any idea what the error regarding prior sample size being 0 is in relation to?

Do you have any suggestions for alternative brms model validation and comparisons when a large number of pareto-k values are high? If I understand correctly, WAIC will be just as poor of an approximation of cross validation but will not give the warnings such as loo does.

Thank you

paul-buerkner commented 2 years ago

If the models are not too large and you can provide me with the exact code that fails for each model (reproducible example), it might help. moment matching is a quite fragile procedure still and it is hard to get stable through all the involved computation layers. Of course, you can use reloo instead of moment_match but this will take some time because the model will be refit several times.

zfarley commented 2 years ago

@paul-buerkner I wanted to give a little update as I have still been trying to diagnose this a bit.

At your request, I was trying to create a smaller model to give you but was having trouble creating one from simulated data that would produce the error. I ended up just using a subset of my data (random selected based on grouping variable). When I subset my data to 50%, brms::add_criterion(fit, "loo", moment_match = T) worked. When using either 75% or 100%, I get the same error as above regarding prior sample sizes.

50% = 88 observations 75% = 132 observations 100% = 176 observations

I then tried these same subsets (50%, 75%, and 100%) using reduced iterations (8000 down to 4000) and reduced the number of chains (4 to 3) just to make sure it was not related to my model specifications. Also, if there is an issue with moment_match = T in regards to model size, I thought this could help. This time 50% and 100% worked but 75% did not.

I then grabbed different observations randomly for each subset and re-ran everything with the reduced iterations/chains because this runs much quicker. I found that sometimes the 50% subset worked and sometimes it didn't. I did the same with more iterations and an additional chain and the same pattern took place (sometimes 50% worked, sometimes not).

Here is table summarizing my results with Y = worked and N = did not work (note - unfortunately I did not think to pair the datasets so T1 in the large column is not the exact same data as T1 in the small data) git_hub_table

I am not sure what to make of this, especially considering it never worked at the 75% level but consistently did at the 100% for the small data.

As a side note, I came across this worked example: https://avehtari.github.io/modelselection/roaches.html, where it is mentioned that PSIS-LOO can fail for random effect models:

We got serious warnings, which in this case is due to having a “random effect” parameter for each observation. Removing one observation changes the posterior for that random effect so much that importance sampling fails (even with Pareto smoothing). We have a very large number of high k values which indicates a very flexible model. While importance sampling in PSIS-LOO can fail for “random effect” model, we can use K-fold-CV instead to re-fit the model 10 times, each time leaving out 10% of the observations. This shows that cross-validation itself is not infeasible for “random effect” models.

Although this does not explain the issue at hand, I think it is important information for others regarding the impetus for using moment_match =T in the first place (i.e., to reduce the number of high pareto - k values which, if using a model with random effects, may be a poor decision compared to using k-fold-CV).

paul-buerkner commented 2 years ago

Thank you so much for your detailed analysis! I really appreciate it! The results are confusing to me too and I am not yet sure what to make out of it. I will have to go into more detail but I am a bit overwhelmed with a lot of stuff at the moment so don't know when I have the time to actually work on some sort of fix; If it is fixable from the brms side at all.

paul-buerkner commented 2 years ago

I have now checked with my new mac and I can confirm this error. However, the error happens inside loo::loo_moment_match.default so I cannot fix this from the brms side unfortunately. I suggest you open an issue at https://github.com/stan-dev/loo if that error persists for you. I am sorry that I cannot do more for you at this point.