Open bob-carpenter opened 5 years ago
@bob-carpenter Thanks for your interest in mcmcr.
I will add more documentation as to what the columns return.
FWIW The lower
and upper
are quantiles with the confidence level set by the user - by default they are the 95% quantiles.
coef(object, conf_level = 0.95,
estimate = stats::median, ...)
The estimate
is defined by the user and by default is the median although the user could of course define it to be the mean or mode.
The zscore is sd/mean.
As you surmised the pvalue is not the standard normal pdf for the zscore. In this context its the smallest credible interval that includes 0. I'm considering replacing it with the equivalent svalue
.
I find it useful as a quick and dirty model selection tool for nuisance parameters.
I understand that information is lost through the use of lower
and upper
rather than lower95
and upper95
and estimate
rather than median
. However this is intentional. I want my code to tabularize or plot estimates to be independent of the method used to construct the credible intervals or estimate. Also I don't want my coefficient tables to have an overwhelming number of columns.
I'm a big fan of (R)Stan and the teams work including your own.
The estimate is defined by the user and by default is the median although the user could of course define it to be the mean or mode.
We're struggling with mean vs. median in reporting.
The zscore is sd/mean.
Why print this?
As you surmised the pvalue is not the standard normal pdf for the zscore. In this context its the smallest credible interval that includes 0. I'm considering replacing it with the equivalent svalue. I find it useful as a quick and dirty model selection tool for nuisance parameters.
How can there be a smallest credible interval that contains 0? If 0 is in the posterior support and it's continuous, there's no smallest posterior interval containing 0.
I understand that information is lost through the use of lower and upper rather than lower95 and upper95 and estimate rather than median. However this is intentional. I want my code to tabularize or plot estimates to be independent of the method used to construct the credible intervals or estimate.
Any 95% interval should the same in terms of coverage if the model's well calibrated. Which one has highest density, etc., will depend on parameterization. I would think anything other than a symmetric central interval, it'd be important to indicate what's being used as readers would otherwise have to guess.
Also I don't want my coefficient tables to have an overwhelming number of columns.
We're struggling with just this issue.
We're struggling with mean vs. median in reporting.
I prefer the median due to its insensitivity to skew in the posterior distribution and because choosing the 50% quantile is logically similar to using the 2.5% and 97.5% quantiles for the lower and upper bounds. I used to use the mean by default until I encountered a posterior probability distribution that was so skewed the mean was greater than the the 97.5% quantile! Now obviously this model had serious problems but it was nonetheless enlightening.
The zscore is sd/mean.
Why print this?
So the user can get the mean if they are so inclined (as the sd is provided) plus its provided in a form that may have some meaning to users who are more familiar with maximum likelihood based methods.
Also I should add that I use the same coefficients table for both Bayesian and Maximum-Likelihood based methods and the ML users typically want the zscore. This is another reason why I use lower instead of lower95.
How can there be a smallest credible interval that contains 0? If 0 is in the posterior support and it's continuous, there's no smallest posterior interval containing 0.
We may be talking at cross purposes here but to my mind the interval 0 - 1 includes 0.
On Jul 23, 2019, at 12:15 PM, Joe Thorley notifications@github.com wrote:
We're struggling with mean vs. median in reporting.
I prefer the median due to its insensitivity to skew in the posterior distribution and because choosing the 50% quantile is logically similar to using the 2.5% and 97.5% quantiles for the lower and upper bounds. I used to use the mean by default until I encountered a posterior probability distribution that was so skewed the mean was greater than the the 97.5% quantile! Now obviously this model had serious problems but it was nonetheless enlightening.
I like to think of this more decision-theoretically. The posterior median is the point estimate that minimizes expected absolute error, whereas the posterior mean minimizes expected squared error (just like an average minimizes the sum of squared distances to all points). The MLE doesn't have a simple decision-theoretic interpretation that I know.
In some cases, we need the posterior mean. For example, if I want to compute an event probability, it's defined as the posterior mean of an indicator function for the event, e.g.,
Pr[theta > 0 | y] = INTEGRAL_0^infty I[theta > 0] * p(theta | y) d.theta
which is the posterior mean of p(theta | y). This is why I've been pushing back against those in the Stan project who want to go all in on medians.
Medians are also tricky for other discrete parameters.
The zscore is sd/mean.
Why print this?
So the user can get the mean if they are so inclined (as the sd is provided) plus its provided in a form that may have some meaning to users who are more familiar with maximum likelihood based methods.
What's this number mean to an MLE method? I went from an ML world where nobody quantified uncertainty to a Bayesian world, so I don't know what a frequentist might do sd/mean, which is why I was asking. A reference is fine---I don't need a long explanation.
Also I should add that I use the same coefficients table for both Bayesian and Maximum-Likelihood based methods and the ML users typically want the zscore. This is another reason why I use lower instead of lower95.
How can there be a smallest credible interval that contains 0? If 0 is in the posterior support and it's continuous, there's no smallest posterior interval containing 0.
We may be talking at cross purposes here but to my mind the interval 0 - 1 includes 0.
The interval [0, 1] = { x : 0 <= x <= 1 } contains 0, but (0, 1) = { x : 0 < x < 1 } does not. My point was just that if [0, 1] is an interval that contains 0, there's no smallest such interval because [0, epsilon] contains 0 for epsilon = 1, 1/2, 1/4, 1/8, ... There's no smallest such epsilon.
I like to think of this more decision-theoretically. The posterior median is the point estimate that minimizes expected absolute error, whereas the posterior mean minimizes expected squared error (just like an average minimizes the sum of squared distances to all points). The MLE doesn't have a simple decision-theoretic interpretation that I know.
Fair enough however I also prefer the median over the mean due to its insensitivity to transformation of the posterior probability distribution. This is a more important property to me than the decision-theoretic interpretation.
In some cases, we need the posterior mean. For example, if I want to compute an event probability, it's defined as the posterior mean of an indicator function for the event, e.g.,
Pr[theta > 0 | y] = INTEGRAL_0^infty I[theta > 0] * p(theta | y) d.theta
which is the posterior mean of p(theta | y). This is why I've been pushing back against those in the Stan project who want to go all in on medians.
I agree that in this instance the mean is required - hence the z-score (see below). However, in my experience such situations are relatively rare and therefore do not justify exclusive use of the mean. Also I think it's important to distinguish between primary parameters that are being estimated and derived parameters that can be determined from the posterior probability distributions for the primary parameters.
https://github.com/poissonconsulting/mcmcderive
I suspect event probabilities are always derived parameters but would be interested to hear of a counter example.
Medians are also tricky for other discrete parameters.
I suspect the same arguments as above apply.
What's this number mean to an MLE method? I went from an ML world where nobody quantified uncertainty to a Bayesian world, so I don't know what a frequentist might do sd/mean, which is why I was asking. A reference is fine---I don't need a long explanation.
Sorry its mean/sd (not sd/mean as I erroneously state above)! It was what I came up with (no reference) as a Bayesian equivalent to the frequentist z-score (or z-statistic). I know that the mode/sd might be closer to the frequentist form however mean/sd has the advantage of allowing the user to quickly extract the mean estimate given the sd.
A reference for the z-score/z-statistic is of course
Millar, R.B. 2011. Maximum likelihood estimation and inference: with examples in R, SAS, and ADMB. Wiley, Chichester, West Sussex.
The interval [0, 1] = { x : 0 <= x <= 1 } contains 0, but (0, 1) = { x : 0 < x < 1 } does not.
True - I was being imprecise.
My point was just that if [0, 1] is an interval that contains 0, there's no smallest such interval because [0, epsilon] contains 0 for epsilon = 1, 1/2, 1/4, 1/8, ... There's no smallest such epsilon.
Point taken. I should have said something like the smallest percentile-based credible interval that includes 0.
On Jul 23, 2019, at 4:47 PM, Joe Thorley notifications@github.com wrote:
I like to think of this more decision-theoretically. The posterior median is the point estimate that minimizes expected absolute error, whereas the posterior mean minimizes expected squared error (just like an average minimizes the sum of squared distances to all points). The MLE doesn't have a simple decision-theoretic interpretation that I know.
Fair enough however I also prefer the median over the mean due to its insensitivity to transformation of the posterior probability distribution.
It's not insensitive to transformation. If I have y = (1, 2, 3), median(y) = 2, median(y^2) = 4, and median(y + 1) = 3.
The bigger point is that depending on the decision, the scale matters. So it's a good thing it's not insensitive to transformation.
In some cases, we need the posterior mean. For example, if I want to compute an event probability, ...
I agree that in this instance the mean is required - hence the z-score (see below). However, in my experience such situations are relatively rare and therefore do not justify exclusive use of the mean.
I didn't mean to urge exclusive use of the mean---I was suggesting that which one you want will depend on the decision-theoretic (or summarization) task at hand.
Also I think it's important to distinguish between primary parameters that are being estimated and derived parameters that can be determined from the posterior probability distributions for the primary parameters.
Why? Often the quantities of interest are derived. For example, I may build a model where the overall risk score for a disease is a derived parameter. In Stan, we have probabilistically derived predictive parameters (in the generated quantities block) that may depend on parameters plus some randomness, but do not affect estimation.
Also, we do a lot of reparameterization, like non-centering multilevel priors or time-series models, the result of which is that the parameter becomes alpha_std = (alpha - mu) / sigma, when we really want to know alpha in our application (it might, say, indicate student ability in an educational testing model, whereas alpha_std is just a kind of z-score).
https://github.com/poissonconsulting/mcmcderive
I suspect event probabilities are always derived parameters but would be interested to hear of a counter example.
They're usually derived. But if you have theta ~ beta(5, 2) and y ~ bernoulli(theta), then Pr[y = 1] = theta.
Medians are also tricky for other discrete parameters.
I suspect the same arguments as above apply.
I'm thinking of discrete parameters like cut points in a change-point model or responsibilities in a mixture model or latent counts in a missing-discrete-data model. All of these look like parameters in general. They'll be derived in expectation in Stan because the parameters are marginalized out, but could also be coded inefficiently as parameters in JAGS or PyMC3.
What's this number mean to an MLE method? I went from an ML world where nobody quantified uncertainty to a Bayesian world, so I don't know what a frequentist might do sd/mean, which is why I was asking. A reference is fine---I don't need a long explanation.
Sorry its mean/sd (not sd/mean as I erroneously state above)! It was what I came up with (no reference) as a Bayesian equivalent to the frequentist z-score (or z-statistic). I know that the mode/sd might be closer to the frequentist form however mean/sd has the advantage of allowing the user to quickly extract the mean estimate given the sd.
A reference for the z-score/z-statistic is of course
Millar, R.B. 2011. Maximum likelihood estimation and inference: with examples in R, SAS, and ADMB. Wiley, Chichester, West Sussex.
I'm not quite curious enough to buy a book. A quick web search revealed something called the "coefficient of variation":
https://en.wikipedia.org/wiki/Coefficient_of_variation
which it describes as a "standardized measure of dispersion". I'm more surprised that any one person would be familiar with R, SAS, and ADMB.
The interval [0, 1] = { x : 0 <= x <= 1 } contains 0, but (0, 1) = { x : 0 < x < 1 } does not.
True - I was being imprecise.
My point was just that if [0, 1] is an interval that contains 0, there's no smallest such interval because [0, epsilon] contains 0 for epsilon = 1, 1/2, 1/4, 1/8, ... There's no smallest such epsilon.
Point taken. I should have said something like the smallest percentile-based credible interval that includes 0.
Is there a worked example somewhere where I don't have to read R code?
It's not insensitive to transformation. If I have y = (1, 2, 3), median(y) = 2, median(y^2) = 4, and median(y + 1) = 3.
OK I'm being imprecise again. What I mean is that median(log(x)) = log(median(x)) but this is not the case with the mean. I think this property is highly desirable.
> median(log(1:100))
[1] 3.921924
> log(median(1:100))
[1] 3.921973
> mean(log(1:100))
[1] 3.637394
> log(mean(1:100))
[1] 3.921973
I didn't mean to urge exclusive use of the mean---I was suggesting that which one you want will depend on the decision-theoretic (or summarization) task at hand.
I agree - I just find that in general the median seems to be the estimate that I want.
Why? Often the quantities of interest are derived. For example, I may build a model where the overall risk score for a disease is a derived parameter. In Stan, we have probabilistically derived predictive parameters (in the generated quantities block) that may depend on parameters plus some randomness, but do not affect estimation.
Because they slow model fitting and make models harder to understand. Also after fitting a model that takes a long time to converge the user might think of a derived parameter they want that they forgot to include in the model. Finally, one might want to combine parameters from separate models.
They're usually derived. But if you have theta ~ beta(5, 2) and y ~ bernoulli(theta), then Pr[y = 1] = theta.
Although in this case doesn't the median (or mean) on theta give you what you want with the advantage of informative credible intervals (without having to get involved with y)?
I'm thinking of discrete parameters like cut points in a change-point model or responsibilities in a mixture model or latent counts in a missing-discrete-data model. All of these look like parameters in general. They'll be derived in expectation in Stan because the parameters are marginalized out, but could also be coded inefficiently as parameters in JAGS or PyMC3.
These are good examples with which I am familiar. However my depth of understanding of STAN is insufficient to fully understand your statement They'll be derived in expectation in Stan because the parameters are marginalized out, but could also be coded inefficiently as parameters in JAGS or PyMC3.
I'm not quite curious enough to buy a book. A quick web search revealed something called the "coefficient of variation":
The coefficient of variation is the inverse of the z-score (also know as z-statistic or standard score) where x = 0
https://en.wikipedia.org/wiki/Standard_score
which it describes as a "standardized measure of dispersion". I'm more surprised that any one person would be familiar with R, SAS, and ADMB.
I don't use SAS but ADMB is close to the frequentist equivalent to STAN (uses C++ to define models) but also does MCMC sampling. I prefer the newer TMB which is an R package. I have written prototypes of a family of packages that use the same functions to interface with STAN, JAGS and TMB. We use them all the time in our work.
https://github.com/poissonconsulting/mbr
You might also be interested in the following package (I've never used it)
https://cran.r-project.org/web/packages/tmbstan/index.html
Is there a worked example somewhere where I don't have to read R code?
No but the R function is relatively straightforward (where x is a vector of the MCMC samples for a single term)
pvalue <- function(x) {
n <- length(x)
d <- sum(x >= 0)
p <- min(d, n - d) * 2
p <- max(p, 1)
round(p / n, 4)
}
See https://github.com/poissonconsulting/mcmcr/blob/master/R/utils.R
On Jul 23, 2019, at 8:30 PM, Joe Thorley notifications@github.com wrote:
It's not insensitive to transformation. If I have y = (1, 2, 3), median(y) = 2, median(y^2) = 4, and median(y + 1) = 3.
OK I'm being imprecise again. What I mean is that median(log(x)) = log(median(x)) but this is not the case with the mean. I think this property is highly desirable.
That's only strictly true if size(x) is odd. There's a lot of latitude in how medians are calculated if there are even numbers of entries. Here's the default in R:
median(log(c(3, 7, 110, 111))) == 3.3 log(median(c(3, 7, 110, 111))) == 4.1
Because they slow model fitting and make models harder to understand. Also after fitting a model that takes a long time to converge the user might think of a derived parameter they want that they forgot to include in the model. Finally, one might want to combine parameters from separate models
Purely derived quantities don't slow fitting unless you have a very simple model and a lot of derived quantities. We don't need to differentiate through them and they only get executed once per iteration (unlike the log density and gradient, which is evaluated at every leapfrog step during the Hamiltonian simulation).
Some generated quantities are actually used in the model, like the centered version of non-centered parameters.
To try to deal with the other issues (complex models, deciding to do other posterior analysis later), we just added a feature in 2.19 that lets you take the parameter draws from one run and then rerun them to recompute generated quantities. It's in CmdStan now and should be out soon in CmdStanPy and maybe even in RStan or PyStan.
They're usually derived. But if you have theta ~ beta(5, 2) and y ~ bernoulli(theta), then Pr[y = 1] = theta.
Although in this case doesn't the median (or mean) on theta give you what you want with the advantage of informative credible intervals (without having to get involved with y)?
The posterior mean of theta is the Pr[y = 1]. The median is different because it's an asymmetric beta distribution.
I'm thinking of discrete parameters like cut points in a change-point model or responsibilities in a mixture model or latent counts in a missing-discrete-data model. All of these look like parameters in general. They'll be derived in expectation in Stan because the parameters are marginalized out, but could also be coded inefficiently as parameters in JAGS or PyMC3.
These are good examples with which I am familiar. However my depth of understanding of STAN is insufficient to fully understand your statement They'll be derived in expectation in Stan because the parameters are marginalized out, but could also be coded inefficiently as parameters in JAGS or PyMC3.
There's a nice paper by Cole Monahaan where he shows how to do the marginalizations in JAGS, too, where it greatly speeds up JAGS models. Take a joint posterior function p(theta, z | y) with continuous parameters theta and discrete parameters z. Then marginalize out z to derive p(theta | y). Then sample theta from the posterior and calculate E[z | y] just as in the EM algorithm, then average those to get the posterior expectation E[z | y]. With that, you have much more accurate estimates than you'd get with discrete sampling (the technical details follow from the Rao-Blackwell theorem and the whole process I'm describing is often described as Rao-Blackwellization).
I'm not quite curious enough to buy a book. A quick web search revealed something called the "coefficient of variation":
https://en.wikipedia.org/wiki/Coefficient_of_variation
The coefficient of variation is the inverse of the z-score (also know as z-statistic or standard score) where x = 0
https://en.wikipedia.org/wiki/Standard_score
which it describes as a "standardized measure of dispersion". I'm more surprised that any one person would be familiar with R, SAS, and ADMB.
I don't use SAS but ADMB is close to the frequentist equivalent to STAN (uses C++ to define models) but also does MCMC sampling. I prefer the newer TMB which is an R package.
I went to an AMBD conference a couple years ago in Seattle and then helped them add HMC and NUTS to ADMB and TMB and evaluate the results (Michael Betancourt on our side helped a lot too). Cole's one of the people involved and he's the one who wrote the paper I'm talking about:
Cole C. Monnahan, James T. Thorson, Trevor A. Branch. 2016. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12681
If you're familiar with TMB (or with lme4) in R, those systems marginalize out out the coefficients in a hierarchical regression in the same way in order to optimize (typically, or sample) the hierarchical parameters. So if beta are the regression coefficients and phi the parameters of their hierarchical prior, we take the posterior p(phi, beta | y) and marginalize out the coefficients to get p(phi | y), which we can then either optimize or sample (you can't jointly optimize p(phi, beta | y) because the likelihood is unbounded). If you optimize the hierarchical parameters in this way (as lme4 and TMB do) and then plug them back in, some people call it "empirical Bayes" (others don't like the implications that full Bayes isn't empirical, so just use terms like max marginal likelihood or the like).
I have written prototypes of a family of packages that use the same functions to interface with STAN, JAGS and TMB. We use them all the time in our work.
Thanks for the link. I just got the pun in your name now that I'm thinking about fisheries models.
You might also be interested in the following package (I've never used it)
I met Kasper at that workshop. There's been a lot of this kind of interporeability/translation stuff going on with lots of packages these days.
Is there a worked example somewhere where I don't have to read R code?
No but the R function is relatively straightforward (where x is a vector of the MCMC samples for a single term)
pvalue <- function(x) { n <- length(x) d <- sum(x >= 0) p <- min(d, n - d) * 2 p <- max(p, 1) round(p / n, 4) }
See https://github.com/poissonconsulting/mcmcr/blob/master/R/utils.R
Thanks.
That's only strictly true if size(x) is odd. There's a lot of latitude in how medians are calculated if there are even numbers of entries. Here's the default in R:
median(log(c(3, 7, 110, 111))) == 3.3 log(median(c(3, 7, 110, 111))) == 4.1
You are correct although in the case of MCMC samples which tend to have very similar values around the 50% quantile the problem is likely to be negligible. Plus generally users tend to generate even numbers.
To try to deal with the other issues (complex models, deciding to do other posterior analysis later), we just added a feature in 2.19 that lets you take the parameter draws from one run and then rerun them to recompute generated quantities. It's in CmdStan now and should be out soon in CmdStanPy and maybe even in RStan or PyStan.
Good to know.
The posterior mean of theta is the Pr[y = 1]. The median is different because it's an asymmetric beta distribution.
OK I understand.
There's a nice paper by Cole Monahaan where he shows how to do the marginalizations in JAGS, too, where it greatly speeds up JAGS models.
Thanks for the reference.
Thanks for the link. I just got the pun in your name now that I'm thinking about fisheries models.
It amuses me.
Thanks for the interesting and informative discussion.
I'm thinking that the best way forward will be for me to introduce a function coef_stan()
that returns a coefficient table that is consistent with the table output by STAN?
On Jul 24, 2019, at 11:38 AM, Joe Thorley notifications@github.com wrote:
That's only strictly true if size(x) is odd. There's a lot of latitude in how medians are calculated if there are even numbers of entries. Here's the default in R:
median(log(c(3, 7, 110, 111))) == 3.3 log(median(c(3, 7, 110, 111))) == 4.1
You are correct although in the case of MCMC samples which tend to have very similar values around the 50% quantile the problem is likely to be negligible.
Right. And if you have multimodal posteriors where that's not true, you have bigger problems than interpolating medians.
Plus generally users tend to generate even numbers.
The even-sized sequences are where you need interpolation for medians.
Thanks for the interesting and informative discussion.
This stuff's important or I wouldn't still be here. I like having these discussions in the open where we can all learn together. It's really useful for us to get feedback from thoughtful MCMC devs and users outside of the Stan community.
I'm thinking that the best way forward will be for me to introduce a function coef_stan() that returns a coefficient table that is consistent with the table output by STAN?
I think it'd make more sense for your package to make it self-consistent rather than consistent with Stan. Now if it's possible to have a mode of output that looks like our output, that might be helpful for users of all interfaces.
The obstacle is that we haven't even settled on a unified output design. All of our interfaces provide different default output. We haven't even been able to do that among just the Stan developers yet.
The biggest development we want to make going forward around this is to use ragged chains of different lengths. Right now, all of our R-hat and ESS calculations rely on the chains being the same length. The trick's going to be weighting them, with obvious extremes being by-draw and by-chain. We need to avoid the problem of just throwing away a chain if it gets stuck, as that can introduce bias into the posterior draws. The motivation's running a bunch of parallel chains for a fixed time, rather than a fixed number of iterations.
P.S. "Stan" isn't an acronym---it was named after Stanislaw Ulam with a wink at the Eminem song.
The even-sized sequences are where you need interpolation for medians.
Face palm.
I think it'd make more sense for your package to make it self-consistent rather than consistent with Stan. Now if it's possible to have a mode of output that looks like our output, that might be helpful for users of all interfaces.
What I am proposing is to keep the current coef.mcmcr()
function but add a second one coef_stan.mcmcr()
that includes the mean and different labelled quantiles. I would make it as close as possible to the Stan output.
The obstacle is that we haven't even settled on a unified output design. All of our interfaces provide different default output. We haven't even been able to do that among just the Stan developers yet.
I guess I would try and make it as close possible to what I considered most Stanish.
The biggest development we want to make going forward around this is to use ragged chains of different lengths. Right now, all of our R-hat and ESS calculations rely on the chains being the same length. The trick's going to be weighting them, with obvious extremes being by-draw and by-chain. We need to avoid the problem of just throwing away a chain if it gets stuck, as that can introduce bias into the posterior draws. The motivation's running a bunch of parallel chains for a fixed time, rather than a fixed number of iterations.
Interesting... I guess mcmcr could be expanded to pad out shorter chains with missing values... I need to think this over.
P.S. "Stan" isn't an acronym---it was named after Stanislaw Ulam with a wink at the Eminem song.
Nice
The example in the README.md shows output columns
term
estimate
sd
zscore
lower
upper
pvalue
I could not find doc for them other than pvalue, which has doc in R/utils.R that says, "@return A number indicating the p-value."
I hate to double-up issues like this, but I would also suggest changing some of these column names to be more informative at first glance.
lower
andupper
are presumably quantiles, but the column headers don't say which ones. In my experience, users want to be able to control the quantiles, for instance to access the median or follow Richard McElreath's quirky lead in reporting prime number intervals in Statistical Rethinking.estimate
doesn't say which estimate the package is using is it a posterior mean (minimize expected square error), posterior median (minimize expected absolute error), or something else? I'm guessing it would be posterior mean since sd is one of the other columns.pvalue
,zscore
don't mean anything to me in this context, other than for use in posterior predictive checks. I would have guessed that the pvalue is just the standard normal cdf of the zscore, but that's not the case from looking at the values. This is the proximate cause for me opening the issue---to ask what these meant.P.S. I actually came here because we're looking for general output analysis packages to recommend for use with (R)Stan.