mjskay / tidybayes

Bayesian analysis + tidy data + geoms (R package)
http://mjskay.github.io/tidybayes
GNU General Public License v3.0
725 stars 59 forks source link

Request to help understand an apparent discrepancy between tidybayes::add_predicted_draws and brms::posterior_predict #280

Closed svats2k closed 3 years ago

svats2k commented 3 years ago

I am using the Howell1 dataset from rethinking package.

require(cmdstanr)
require(brms)
require(tidybayes)
require(rethinking)

data("Howell1")
d <- Howell1
d2 <- d[d$age > 18,]

d2$hs <- (d2$height - mean(d2$height))/ sd(d2$height)
d2$ws <- (d2$weight - mean(d2$weight))/ sd(d2$weight)

Building a simple brms model using one numeric and one categorical predictor

priors <- c(prior(normal(0,2), class = "Intercept"),
            prior(normal(0,2), class = 'b'),
            prior(cauchy(0,2), class = "sigma"))

m4.4 <- brm(formula = hs ~ 1 + ws + male, data = d2, family = gaussian,
          backend = "cmdstanr", prior = priors
          iter = 2000, warmup = 1000, chains = 4, cores = 4)

I am trying to understand how add_fitted_draws and add_predicted_draws work.

Considering add_fitted_draws:

i <- 4 # looking at the results for a particular row of the input dataset
y <- posterior_epred(m4.4)
x <- d2 %>% add_fitted_draws(model = m4.4, value = "epred")
x %>% as_tibble() %>% filter(.row ==i) %>% dplyr::select(epred) %>% cbind(fitdr = y[,i]) %>% mutate(diff = fitdr - epred)

Based on the documentation add_fitted_draw internally uses posterior_epred or its equivalent in brms and the results exactly match.

image

Now when I move on to do exactly the same between add_predicted_draws and posterior_predict, the results don't match

yp <- posterior_predict(m4.4)
xp <- d2 %>% add_predicted_draws(model = m4.4, prediction = "pred")
xp %>% as_tibble() %>% filter(.row ==i) %>% dplyr::select(pred) %>% cbind(preddr = yp[,i]) %>% mutate(diff = preddr - pred)

image

I am pretty sure there is a gap in my understanding, please advice.

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

other attached packages:
 [1] stringr_1.4.0        readr_1.4.0          tibble_3.0.4         tidyverse_1.3.0      MASS_7.3-53          bayesplot_1.8.0      cmdstanr_0.1.3       rethinking_2.13     
 [9] loo_2.4.1            gganimate_1.0.7      RColorBrewer_1.1-2   ggrepel_0.9.0        brms_2.14.4          Rcpp_1.0.5           rstan_2.21.2         StanHeaders_2.21.0-7
[17] cowplot_1.1.1        ggplot2_3.3.3        tidybayes_2.3.1      ggdist_2.4.0         modelr_0.1.8         tidyr_1.1.2          forcats_0.5.0        purrr_0.3.4         
[25] dplyr_1.0.2          magrittr_2.0.1      
mjskay commented 3 years ago

I suspect this would be because new draws from the posterior predictive distribution are taken from a random number generator when you call each of those functions. Thus you would need to set the same seed for the random number generator before each function in order to compare them (e.g. by using set.seed() or the seed argument to add_predicted_draws())

svats2k commented 3 years ago

Thanks @mjskay for the prompt response. I have tried to set the seed as well.

set.seed(1234)
yp <- posterior_predict(m4.4)
set.seed(1234)
xp <- add_predicted_draws(model = m4.4, prediction = "pred", newdata = d2)
xp %>% as_tibble() %>% 
  filter(.row ==i) %>% dplyr::select(pred) %>% cbind(preddr = yp[,i]) %>% 
  mutate(diff = preddr - pred) %>% 
  ggplot(aes(x = diff)) + geom_density()

The density plot of 'diff' is as follows:

image

If setting the seed were the issue, shouldn't I have noticed some difference in the add_fitted_draws check as well, though the latter only checks for conditional averages. They seem to match exactly to the 4th decimal place, while add_predicted_draws shows noticeable difference.

While this is highly unlikely to be a bug, I am mainly seeking clarity in my understanding.

mjskay commented 3 years ago

I just tried it and did not see any difference between the two outputs. Can you try it without backend = "cmdstanr"? That is the only difference between what I ran and the code you provided (I don't have cmdstan on this machine at the moment)

svats2k commented 3 years ago

I have tried rstan based runs as well

priors <- c(prior(normal(0,2), class = "Intercept"),
            prior(normal(0,2), class = 'b'),
            prior(cauchy(0,2), class = "sigma"))

m4.4 <- brm(formula = hs ~ 1 + ws, data = d2, family = gaussian, prior = priors, backend = "rstan")

i <- 4
set.seed(1234)
yp <- posterior_predict(m4.4)
set.seed(1234)
xp <- add_predicted_draws(model = m4.4, prediction = "pred", newdata = d2)
xp %>% 
  as_tibble() %>% filter(.row ==i) %>% 
  dplyr::select(pred) %>% cbind(preddr = yp[,i]) %>% 
  mutate(diff = preddr - pred) %>% 
  ggplot(aes(x = diff)) + geom_density()

The results don't match again image

I am wondering if there is any difference at all in the code that you've used versus mine. Can you please share it so that I can test it out on my system?

mjskay commented 3 years ago

I copied and pasted your code and ran it verbatim. The output plot is empty:

require(brms)
require(tidybayes)
require(rethinking)

data("Howell1")
d <- Howell1
d2 <- d[d$age > 18,]

d2$hs <- (d2$height - mean(d2$height))/ sd(d2$height)
d2$ws <- (d2$weight - mean(d2$weight))/ sd(d2$weight)

priors <- c(prior(normal(0,2), class = "Intercept"),
            prior(normal(0,2), class = 'b'),
            prior(cauchy(0,2), class = "sigma"))

m4.4 <- brm(formula = hs ~ 1 + ws, data = d2, family = gaussian, prior = priors, backend = "rstan")

i <- 4
set.seed(1234)
yp <- posterior_predict(m4.4)
set.seed(1234)
xp <- add_predicted_draws(model = m4.4, prediction = "pred", newdata = d2)
xp %>% 
  as_tibble() %>% filter(.row ==i) %>% 
  dplyr::select(pred) %>% cbind(preddr = yp[,i]) %>% 
  mutate(diff = preddr - pred) %>% 
  ggplot(aes(x = diff)) + geom_density()

image

And if I follow it up with a call to summary(), the difference is 0:

xp %>% 
  as_tibble() %>% filter(.row ==i) %>% 
  dplyr::select(pred) %>% cbind(preddr = yp[,i]) %>%
  mutate(diff = preddr - pred) %>%  
  with(summary(diff))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       0       0       0 

I wonder if this is a brms issue on your setup? Have you checked to see if multiple calls to posterior_predict yield identical results with the same seed? Something like this:

set.seed(1234)
yp1 <- posterior_predict(m4.4)
set.seed(1234)
yp2 <- posterior_predict(m4.4)

summary(as.vector(yp1 - yp2))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       0       0       0 

If that's not 0 I'd say it's a brms issue

svats2k commented 3 years ago

image

You are right. it does seem a brms issue. One last request before I close the issue, can you please share the output from sessionInfo()? Just so that I try to install the same versions that you have and confirm I observe the same as well.

The versions that I am using are:

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

other attached packages:
 [1] gridExtra_2.3        bayesplot_1.8.0      cmdstanr_0.1.3       rethinking_2.13      gganimate_1.0.7      RColorBrewer_1.1-2  
 [7] ggrepel_0.9.1        brms_2.14.4          Rcpp_1.0.6           rstan_2.21.2         StanHeaders_2.21.0-7 cowplot_1.1.1       
[13] ggplot2_3.3.3        tidybayes_2.3.1      ggdist_2.4.0         modelr_0.1.8         tidyr_1.1.2          forcats_0.5.0       
[19] purrr_0.3.4          dplyr_1.0.3          magrittr_2.0.1  
mjskay commented 3 years ago

Sure, here. I'm running the dev version of tidybayes but that shouldn't be an issue (given that this looks like a brms problem).

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rethinking_2.13      rstan_2.21.2         ggplot2_3.3.3        StanHeaders_2.21.0-7 tidybayes_2.3.1.9000
[6] brms_2.14.4          Rcpp_1.0.5          

loaded via a namespace (and not attached):
  [1] minqa_1.2.4          colorspace_2.0-0     ellipsis_0.3.1       ggridges_0.5.3       rsconnect_0.8.16    
  [6] estimability_1.3     markdown_1.1         base64enc_0.1-3      rstudioapi_0.13      farver_2.0.3        
 [11] svUnit_1.0.3         DT_0.17              fansi_0.4.1          mvtnorm_1.1-1        bridgesampling_1.0-0
 [16] codetools_0.2-18     splines_4.0.3        knitr_1.30           shinythemes_1.1.2    bayesplot_1.8.0     
 [21] projpred_2.0.2       jsonlite_1.7.2       nloptr_1.2.2.2       ggdist_2.4.0         shiny_1.5.0         
 [26] compiler_4.0.3       emmeans_1.5.3        backports_1.2.1      assertthat_0.2.1     Matrix_1.3-2        
 [31] fastmap_1.0.1        cli_2.2.0            later_1.1.0.1        htmltools_0.5.1      prettyunits_1.1.1   
 [36] tools_4.0.3          igraph_1.2.6         coda_0.19-4          gtable_0.3.0         glue_1.4.2          
 [41] reshape2_1.4.4       dplyr_1.0.2          tinytex_0.28         V8_3.4.0             vctrs_0.3.6         
 [46] nlme_3.1-151         crosstalk_1.1.1      xfun_0.20            stringr_1.4.0        ps_1.5.0            
 [51] lme4_1.1-26          mime_0.9             miniUI_0.1.1.1       lifecycle_0.2.0      gtools_3.8.2        
 [56] statmod_1.4.35       MASS_7.3-53          zoo_1.8-8            scales_1.1.1         colourpicker_1.1.0  
 [61] promises_1.1.1       Brobdingnag_1.2-6    inline_0.3.17        shinystan_2.5.0      gamm4_0.2-6         
 [66] yaml_2.2.1           curl_4.3             gridExtra_2.3        loo_2.4.1            stringi_1.5.3       
 [71] dygraphs_1.1.1.6     boot_1.3-25          pkgbuild_1.2.0       shape_1.4.5          rlang_0.4.10        
 [76] pkgconfig_2.0.3      matrixStats_0.57.0   distributional_0.2.1 evaluate_0.14        lattice_0.20-41     
 [81] purrr_0.3.4          rstantools_2.1.1     htmlwidgets_1.5.3    processx_3.4.5       tidyselect_1.1.0    
 [86] plyr_1.8.6           magrittr_2.0.1       R6_2.5.0             generics_0.1.0       pillar_1.4.7        
 [91] withr_2.3.0          mgcv_1.8-33          xts_0.12.1           abind_1.4-5          tibble_3.0.4        
 [96] crayon_1.3.4         arrayhelpers_1.1-0   rmarkdown_2.6        grid_4.0.3           callr_3.5.1         
[101] forcats_0.5.0        threejs_0.3.3        digest_0.6.27        xtable_1.8-4         tidyr_1.1.2         
[106] httpuv_1.5.5         RcppParallel_5.0.2   stats4_4.0.3         munsell_0.5.0        shinyjs_2.0.0  
svats2k commented 3 years ago

@mjskay I had posted this issue in the brms repo and the issue has been root caused and I thought I will paste the response here.

"If you have set options(mc.cores = <more than 1>), posterior_predict will evaluate in parallel by default, unless you change the core argument. On windows, parallel execution is done via parallel::parLapply and I don't know how that function respects seeds, if at all. When executing the code in serial (with 1 core) the results are reproducible."

Once I set the mc.cores to 1, I no longer see the discrepancy between posterior_predict and add_predicted_draws.

mjskay commented 3 years ago

Thanks for posting that back --- that's good to know!