stan-dev / posteriordb

Database with posteriors of interest for Bayesian inference
181 stars 36 forks source link

Add "virtual posteriors" #101

Open MansMeg opened 4 years ago

MansMeg commented 4 years ago

The following is not a posterior as such and should hence be handled a little differently.

https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks/low_dim_corr_gauss

eerolinna commented 4 years ago

Is this issue essentially just a reminder? So you already know what needs to be done to fix this and don't need any input from others. Or is it that we need to think about how virtual posteriors should be handled?

If it is the latter I would appreciate if you can elaborate a bit on why low_dim_corr_gauss is not a posterior and how we might want to handle it so we can get the discussion started.

MansMeg commented 4 years ago

Sure. So the low_dim_corr_gauss is essentially just simulating draws from a distribution using dynamic HMC. There is no data, and hence, by definition, it is not a posterior. Although it may be good to test algorithms. I and @avehtari spoke about that this could still be of interest, but it is not clear if it should be in posteriordb.

eerolinna commented 4 years ago

Ah yeah so essentially in this case the parameters for the distribution are given directly and the stan model only draws samples from that distribution rather than estimating a posterior distribution.

Here's some ideas I had, I'm open to alternatives too.

Approach 1

We could fit this use case into the current structure quite easily. We could change

transformed data {
  vector[2] mu;
  real sigma1;
  real sigma2;
  real rho;

  matrix[2,2] Sigma;

  mu[1] = 0.0;
  mu[2] = 3.0;

  rho = 0.5;
  sigma1 = 1.0;
  sigma2 = 2.0;

  Sigma[1][1] = sigma1 * sigma1;
  Sigma[1][2] = rho * sigma1 * sigma2;
  Sigma[2][1] = rho * sigma1 * sigma2;
  Sigma[2][2] = sigma2 * sigma2;
}

To be

data {
  vector[2] mu;
  real sigma1;
  real sigma2;
  real rho;
}
transformed data {
  matrix[2,2] Sigma;

  Sigma[1][1] = sigma1 * sigma1;
  Sigma[1][2] = rho * sigma1 * sigma2;
  Sigma[2][1] = rho * sigma1 * sigma2;
  Sigma[2][2] = sigma2 * sigma2;
}

along with a data file

{
  "mu": [0.0, 3.0],
  "rho": 0.5,
  "sigma1": 1.0,
  "sigma2": 2.0
}

Sigma could also be moved to a data file if needed.

For gold standard we could have the draws from HMC.

Approach 2

We could also make it so that a posterior does not always require a dataset. In this case providing just the model would be sufficient.

Approach 3

In both of these cases the virtual posterior is really just a distribution and calling it a posterior might be misleading. It would be better to have also a concept of a standalone distribution.

In this case we might have distributions/low_dim_corr_gauss.json

{
  "name": "low_dim_corr_gauss",
  "keywords": ["stan_benchmark"],
  "model_name": "low_dim_gauss",
  "gold_standard_name": "low_dim_corr_gauss",
  "params_name": "low_dim_corr_gauss",
}

where params_name would point to a parameter file

{
  "mu": [0.0, 3.0],
  "rho": 0.5,
  "sigma1": 1.0,
  "sigma2": 2.0
}

(or we could have these parameters inline with the model if needed) In this case model_name might be better called simulator_name.

We could have a separate pdb_distribution class in the API and have for example gold_standard_draws(x) be a generic method that can take a posterior or a distribution.

MansMeg commented 4 years ago

I think your thought on pdb_distribution is probably the best approach forward. Then we avoid the problem of confusing posteriors with distributions. Then the model file could just generate a distribution for a given model but without data.