stan-dev / posteriordb

Database with posteriors of interest for Bayesian inference
182 stars 36 forks source link
bayes bayesian posterior

posteriordb
Content R-CMD-check Codecov test
coverage Python

posteriordb: a database of Bayesian posterior inference

What is posteriordb?

posteriordb is a set of posteriors, i.e. Bayesian statistical models and data sets, reference implementations in probabilistic programming languages, and reference posterior inferences in the form of posterior samples.

Why use posteriordb?

posteriordb is designed to test inference algorithms across a wide range of models and data sets. Applications include testing for accuracy, speed, and scalability. posteriordb can be used to test new algorithms being developed or deployed as part of continuous integration for ongoing regression testing algorithms in probabilistic programming frameworks.

posteriordb also makes it easy for students and instructors to access various pedagogical and real-world examples with precise model definitions, well-curated data sets, and reference posteriors.

posteriordb is framework agnostic and easily accessible from R and Python.

For more details regarding the use cases of posteriordb, see doc/use_cases.md.

Content

See DATABASE_CONTENT.md for the details content of the posterior database.

Contributing

We are happy with any help in adding posteriors, data, and models to the database! See CONTRIBUTING.md for the details on how to contribute.

Using posteriordb

To simplify the use of posteriordb, there are convenience functions both in R and in Python.

Citing posteriordb

Developing and maintaining open-source software is an important yet often underappreciated contribution to scientific progress. Thus, please make sure to cite it appropriately so that developers get credit for their work. Information on how to cite posteriordb can be found in the CITATION.cff file. Use the “cite this repository” button under “About” to get a simple BibTeX or APA snippet.

As posteriordb rely heavily on Stan, so please consider also to cite Stan:

Carpenter B., Gelman A., Hoffman M. D., Lee D., Goodrich B., Betancourt M., Brubaker M., Guo J., Li P., and Riddell A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software. 76(1). 10.18637/jss.v076.i01

Design choices (so far)

The main focus of the database is simplicity, both in understanding and in use.

The following are the current design choices in designing the posterior database.

  1. Priors are hardcoded in model files as changing the prior changes the posterior. Create a new model to test different priors.
  2. Data transformations are stored as different datasets. Create new data to test different data transformations, subsets, and variable settings. This design choice makes the database larger/less memory efficient but simplifies the analysis of individual posteriors.
  3. Models and data has (model/data).info.json files with model and data specific information.
  4. Templates for different JSONs can be found in content/templates and schemas in schemas (Note: these don’t exist right now and will be added later)
  5. Prefix ‘syn_’ stands for synthetic data where the generative process is known and found in content/data-raw.
  6. All data preprocessing is included in content/data-raw.
  7. Specific information for different PPL representations of models is included in the PPL syntax files as comments, not in the model.info.json files.

Versioning of models

We might update models included in posteriordb over time. However, the models will only have the same name in posteriordb if the log density is the same (up to a normalizing constant). Otherwise, we will include a new model in the database.