paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.28k stars 184 forks source link

posterior_epred with splines results in different results on different machines #1386

Closed HaydenSchilling closed 1 year ago

HaydenSchilling commented 2 years ago

Hi,

I've come across a difference in results on two machines running the same version of brms (2.17.0) and I'm wondering if you have any insight. This model is needed to run the code (apolagises for the size ~450mb).

If i run the script below on two different machines the results from posterior_epred are very different and I'm not sure which one is correct.

library(brms)

f4 <- readRDS("Golden perch/Golden perch_ZINB_overall_boat.rds")

summary(f4) # same on both machines

new_data <- data.frame(
  days_since = seq(0, 10100, 50),
  Method = "Boat Electrofishing",
  Sampling_duration = 90) 

xx <- posterior_epred(f4, newdata = new_data, re_formula = NA) 
quantile(xx[,1], probs = c(0.025,0.5, 0.975)) # 0.0015 - 0.0271 on Machine 1, 0.0082 - 0.41 on Machine 2
quantile(xx[,203], probs = c(0.025,0.5, 0.975)) # 0.1249 - 0.729 on Machine 1, 0.218 - 1.57 on Machine 2

The session info for the two machines are: Machine 1

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
[3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brms_2.17.0  Rcpp_1.0.8.3

Machine 2

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /apps/intel/Composer/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brms_2.17.0 Rcpp_1.0.5 

Thank you for any insight!

paul-buerkner commented 2 years ago

How does the brm() call of the model look like?

HaydenSchilling commented 2 years ago

The original model was:

b2 <- brm(Golden ~ s(days_since) + (1|SiteID) + (1|fDate) +
            offset(log(Sampling_duration)),
          cores=4,
          family="zero_inflated_negbinomial",
          data=mydata_wide,
          iter = 5000)

A similar thing occurs for family = "negbinomial" so I don't think it is an issue with the zero-inflation family.

paul-buerkner commented 2 years ago

I think it is a problem with the spline. In a project, collaborators of mine and I have had similar problems when analysing a model on local laptops that was originally fit in on a cluster. I have yet to find the origin of this problem, but I suspect it lies deep in some matrix algebra (perhaps in base R or deeper) that is somehow (I have no idea why) system dependent. I see that one of your machines is windows while the other one is linux. Is the latter machine a cluster?

Back then, we managed to solve the problem by using R 3.5 or 3.6 (I think) on the cluster instead of R 4.0.x, but we don't know why it helped.

HaydenSchilling commented 2 years ago

You are correct, the linux machine is a cluster which originally fit the model. I have been looking at the results on my laptop and the cluster and noticed differences.

Do you suggest I attempt to redo the models with R 3.6 on the cluster? Assuming if the results then agree when looked at on both the laptop and cluster it's OK?

paul-buerkner commented 2 years ago

If you have responsive cluster admins, you may also ask them if they have seen something like that before, where presumably, different linear algebra setups imply different results.

I think iwe got it to work with R 3.5 back then but perhaps also versions later then R 4.0 work. I can only suggest to try out different settings, sorry. If results agree on both machines I think it is safe to say that the results are trustworthy.

HaydenSchilling commented 2 years ago

Thank you, that sounds like a good way forward! I really appreciate the help. I'll talk to the cluster admins and if they have any insight post it here.

paul-buerkner commented 2 years ago

Could the following twitter thread provide the reason for the problem you encountered? https://twitter.com/dan_p_simpson/status/1571705560634105857

HaydenSchilling commented 2 years ago

Interesting idea - maybe. I tried running the suggested export commands on the HPC before opening R and re-running but the predictions did not change. (full disclosure - I don't really understand what exactly the change in MKL does)

paul-buerkner commented 2 years ago

thank you for testing it out! still my assumption is some problems in linear algebra operations. some people in this thread on Twitter reported that the same deterministic operation performed multiple times may give different results on specific awkward circumstances. to me it looks as if something similar may happen in the here discussed case. perhaps someone else has ideas what exactly is happening?

HaydenSchilling @.***> schrieb am Sa., 24. Sept. 2022, 04:32:

Interesting idea - maybe. I tried running the suggested export commands on the HPC before opening R and re-running but the predictions did not change. (full disclosure - I don't really understand what exactly the change in MKL does)

— Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/1386#issuecomment-1256838186, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AGJZHI2U6JPSQKMFEDV7ZR4RANCNFSM55NEPZJQ . You are receiving this because you commented.Message ID: @.***>

paul-buerkner commented 1 year ago

The cause of this problem has now been isolate in #1465 although we still need to figure out how to actually fix it reliably.

pmhogan commented 1 year ago

@HaydenSchilling -- this issue may be due to a mismatch between your BLAS/LAPACK library implementations. From your session info's, it appears that Machine 1 is (probably) using the default, R-internal version but Machine 2 is using Intel's Math Kernel Library implementation. In issue #1465 I describe a similar problem for spline models which I was able to resolve by ensuring that either the default, R-internal libraries, or the same version of external libraries, are used in both training and prediction environments 👍

paul-buerkner commented 1 year ago

I will actually close this issue here to make the new isse the new hub for dicussions around this problem and hopefully its eventual fix. you can of course continue writing here. I just want to make sure to keep the issue tracker clean and avoid duplication.