MansMeg commented 4 years ago

Here are the documentation after today's discussion. Would be great with your comments @avehtari and @paul-buerkner . Especially regarding the gold standard definition.

codecov-io commented 4 years ago

Codecov Report

Merging #79 into master will decrease coverage by 5.05%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master      #79      +/-   ##
==========================================
- Coverage   99.31%   94.25%   -5.06%     
==========================================
  Files           5       17      +12     
  Lines         145      418     +273     
==========================================
+ Hits          144      394     +250     
- Misses          1       24      +23

Impacted Files	Coverage Δ
python/src/posteriordb/posterior.py	`100% <0%> (ø)`	:arrow_up:
python/src/posteriordb/posterior_database.py	`100% <0%> (ø)`	:arrow_up:
python/src/posteriordb/model.py	`100% <0%> (ø)`	:arrow_up:
python/src/posteriordb/__init__.py	`100% <0%> (ø)`	:arrow_up:
rpackage/R/gold_standard.R	`100% <0%> (ø)`
rpackage/R/data_info.R	`100% <0%> (ø)`
rpackage/R/utils.R	`88.88% <0%> (ø)`
rpackage/R/posterior_fit.R	`100% <0%> (ø)`
rpackage/R/utils_tests.R	`80% <0%> (ø)`
rpackage/R/posterior.R	`100% <0%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6b2adfa...6718709. Read the comment docs.

MansMeg commented 4 years ago

@paul-buerkner and @avehtari It would be great to get your comments on the documentation.

MansMeg commented 4 years ago

Thanks, Paul. Any thoughts on the gold standard definition?

paul-buerkner commented 4 years ago

I have looked at the gold standard doc again and have a few comments:

Link to our Rhat paper so that users know what we mean by "latest version" of Rhat.
For what purpose do we have an upper bound of the effective samples size? Wouldn't it be sufficient to provide a lower bound?
I would write "Although other approaches can also be used in special circumstances (e.g., for discrete parameter models) if a clear point can be made that this is necessary."
In addition to divergent transititions, there should be no draws exceeding the maximum treedepth.

avehtari commented 4 years ago

For what purpose do we have an upper bound of the effective samples size? Wouldn't it be sufficient to provide a lower bound?

We want almost independent draws. If effective sample size is larger than the nominal, then there is dependency which makes further analysis more difficult.

In addition to divergent transititions, there should be no draws exceeding the maximum treedepth

No. Exceeding max treedepth doesn't invalidate the Markov chain, it just indicates potential performance issues, but even then limiting max treedepth to lower values can gives us improved ESS per log density evaluation.

paul-buerkner commented 4 years ago

Ok thanks for the clarifications!

MansMeg commented 4 years ago

Great! This is great! I'm currently fixing the gold standards to conform to this, by including both chains when stan_sampling has been used and

Also. We would like to flag funnel posteriors and multimodal posteriors, and maybe even convex posteriors. But the best would be to just flag this from the posterior samples rather than manually annotating them. My guess is that it should be possible to estimate funnels based on 10 000 samples? The same with multimodal posteriors? What do you say?

paul-buerkner commented 4 years ago

If we have no divergent transitions, then chances are we have captured the funnel if the chains ever came remotely close the the funnel.

For multimodality there is not quanrantee that we capture all the modes. It might as well be that we simply missed some modes. Ideally, we should work with multimodal distribution where we understand how many modes we have and as such can check if we found all of them. But in a lot of cases multimodality indicates some way of model misspecification (for instance if we forgot to identify mixture compoenents).

MansMeg commented 4 years ago

Alright, changes added. I'll merge later today if no one has any comments.

eerolinna commented 4 years ago

We would like to flag funnel posteriors and multimodal posteriors, and maybe even convex posteriors. But the best would be to just flag this from the posterior samples rather than manually annotating them. My guess is that it should be possible to estimate funnels based on 10 000 samples? The same with multimodal posteriors? What do you say?

Could we use the multimodality/funnel information to help validate the gold standards? So for example we know that some posterior should have 3 distinct modes. However the estimate says that there are only two, this might mean that the samples are not a valid gold standard (or it can mean that the estimate is wrong). Lets say we have 100 manual annotations, if we can catch 1 bad gold standard with them it would probably be worth the effort.

Of course this requires that we actually know the true number/location of modes etc so I don't know how useful this would be in practice.

MansMeg commented 4 years ago

No, we do not need this. I spoke with Aki and he thought it would be good to have this functionality. Although - only as suggestions. We want humans to add all keywords manually.

stan-dev / posteriordb

Documentation and structure #79

Codecov Report