steve-the-bayesian / BOOM

A C++ library for Bayesian modeling, mainly through Markov chain Monte Carlo, but with a few other methods supported. BOOM = "Bayesian Object Oriented Modeling". It is also the sound your computer makes when it crashes.
GNU Lesser General Public License v2.1
35 stars 14 forks source link

Different results from BSTS models in Windows v/s Linux environments #55

Closed ankbaras closed 3 years ago

ankbaras commented 3 years ago

Different results from BSTS models in Windows v/s Linux environments

Hi Steve,

We are using BSTS models for a certain forecasting project, and while the models are working quite well on our local Windows systems that we used for prototyping the code, we notice performance degradation on average while trying to deploy them on a Linux environment (Ubuntu 20.04). To diagnose what might be going wrong, we performed a simple experiment with R's iris dataset. Attaching the results where we tried varying the environment, the version of base R used, as well as differing versions of the library itself.

image

PFB the code used to arrive at the above results:

`library(bsts) library(datasets)

train <- iris[1:100, 1:3] test <- iris[101:150, 1:3]

state <- list() state <- bsts::AddLocalLevel(state, train$Sepal.Length) model <- bsts::bsts(Sepal.Length ~ ., state.specification=state, niter = 1000, data = train, seed = 2020, ping = 0)

burn <- 100 preds <- bsts::predict.bsts(model, newdata=test, seed=2020, burn=burn) preds$mean

colMeans(model$coefficients[-(1:burn), ])`

A deeper dive in the source code leads me to believe this behaviour might be arising due to using srand() to set the seed within the seed_rng_from_R.cpp file. Though in most cases the differences in results are not as significant, since we are modeling on log(x) rather than x, these differences tend to be exaggerated while evaluating prediction accuracy on the actual data.

steve-the-bayesian commented 3 years ago

Hi Ankit. BOOM uses a modern random number generator that is part of the C++ library. As far as I'm aware that RNG guarantees the same seed will provide the same trained models, and the same predictions, as long as you stick to the same platform. Different platforms are free to implement parts of the C++ standard library as they see fit, which means you might get different streams of pseudo-random numbers on different platforms. If you are passing along the 'seed' argument then seed_rng_from_R should not be getting called. I will close this issue, but please feel free to re-open it if you think 'seed' is being ignored.