Add expectations and variances in gold standard

MansMeg commented 4 years ago

We may be interested in gold standards expectations without actual gold standard draws. Hence we should add them as a separate slot that could be accessed.

This should include: Mean, variances and covariance for all parameters based on 100 000 draws. Hence a new gold standard slot should be created.

eerolinna commented 4 years ago

Is the idea here that we can get more accurate expectations by using more samples? So we might have a gold standard with 10 000 draws but expectations computed with 100 000 draws?

Or is it more that for some posteriors we might have a way of computing accurate expectations but not accurate draws? I guess simulated posteriors also fall under this case as we can have ground truth expectations while the expectations computed from draws are only estimates.

MansMeg commented 4 years ago

Yes. Thats the reason. We could have much more accurate estimates of expectations and the covariance.

Good point. We should also add how the expectations were calculated.

eerolinna commented 4 years ago

So posteriors/8_schools.json might be something like this?


{
  "name": "eight_schools-eight_schools_noncentered",
  "keywords": ["stan_benchmark"],
  "model_name": "eight_schools_noncentered",
  "reference_draws_name": "eight_schools-eight_schools_noncentered",
  "reference_expectations_name": "something",
  "data_name": "eight_schools",
  "dimensions": {"theta":8, "mu":1, "tau":1},
  "added_by": "Mans Magnusson",
  "added_date": "2019-08-12"
}

(I'm using reference instead of gold standard here)

I guess the "something" in "reference_expectations_name": "something" could also be eight_schools-eight_schools_noncentered if we have separate folders for reference draw files and reference expectation files.

MansMeg commented 4 years ago

Exactly!

eerolinna commented 4 years ago

Here's some things that popped to my mind.

1

Lets say we have a posterior where the expectations computed with 100 000 draws are more accurate than the ones computed from a sample of 10 000 draws. Do we even want to expose the smaller sample?

One could argue that if the expectation with the smaller draws is less accurate then the draws do not represent the posterior well and should not ever be used.

1.5

Lets continue the previous case. We know that 100 000 draws gave a better result. Should we also try 1 000 000 draws to see if that gives an even better result?

2

Lets say we have a small posterior where storing large draws (lets say 100 000) takes less space than storing small draws (say 10 000) of a "normal-sized" posterior. For the small posterior the large draws gives a more accurate estimate than with small one. Should we just store the large draws (so 100 000) in this case?

In other words, do we want to have a fixed number of draws in the first place?

3

How can we recognize that an estimate is more accurate than another estimate? Sure it is more likely that a larger sample gives more accurate estimate but it is still possible that sometimes the smaller sample gives a better estimate. Or can we consider the chance of this to be small enough to be ignored?

MansMeg commented 4 years ago

1) There are different use-cases. So if you only want to check that you get the expectations right, then the larger sample is better. Storing expectations from 100 000 draws is also less costly (it only depends on the dimension). Although others may want to have draws from the posterior to, for example compute log_lik values for a subset of observations. @avehtari is working with writing down use cases now.

1.5) The more draws the better. We need to set the bar for reference draws somewhere since there is a computational cost, especially for larger models.

2) Yes, to make it simple and straight-forward.

3) Using the MC error.

eerolinna commented 4 years ago

I probably explained 1. a bit poorly. What I mean is that

We know that 10 000 draws doesn't give an accurate posterior mean (or some other expectation) for some posterior (this is because it is different from the mean obtained with 100 000 draws)
Thus computing log_lik values from the draws won't be an accurate representation of the actual log_lik values (is this right?)
So we should only expose the mean and not the 10 000 draws because using the draws is likely to result in wrong conclusions

As for 2. perhaps we should have a simple and straightforward guideline (10 000 draws are preferred) that we can deviate from if there's good reasons to do so. Maybe this is what you also had in mind?

MansMeg commented 4 years ago

The difference will (mostly) only be a difference in MC error. And with 10 000 samples the MC error is still very small.
Yes.

eerolinna commented 4 years ago

This is what I'm essentially hearing: 10 000 samples will have a small error compared to 100 000 and thus it's fine to use the smaller sample to compute log_lik etc. Yet 10 000 samples has too big of an error to compute expectations so we need to use the larger sample for that.

This sounds like a contradiction.

MansMeg commented 4 years ago

10000 samples is good enogh in most situations, 100 000 is better but would take up 10x space. There is no on-off here. Computing expectations and covariance with respect to 100 000 give a slightly better estimate but with no additional storage cost. Using 1000000 would be even better but we need to draw the line somewhere.

MansMeg commented 4 years ago

Expectations and Covariance and the MCSE for expectations and Covariance.

Implement batch method in posterior for MCSE estimation
Add reference_posterior_moments and reference_posterior_quantiles together with their with MCSE.

stan-dev / posteriordb

Add expectations and variances in gold standard #108

1

1.5

2

3