paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.28k stars 183 forks source link

"Marginal" hurdle models #641

Open rpsychologist opened 5 years ago

rpsychologist commented 5 years ago

I think adding "marginal" versions of the current hurdle models supported in brms would be extremely useful. Where "marginal" means that we model the effect of covariates on the overall outcome E(Y | b) instead of the conditionally positive part E(Y | Y > 0, b). This can be done simply by solving for the conditional mean as a function of the overall mean (cf., Smith et al https://doi.org/10.1177/0962280215592908). Here's a hackish example where the hurdle_gamma lpdf is changed to represent the marginal version using the custom_family function.

# Custom marginal gamma two-part ------------------------------------------
hurdle_gamma_mtp <- custom_family(
    "hurdle_gamma_mtp",
    dpars = c("shape","mu", "hu"),
    lb = c(0, NA, 0),
    ub = c(NA, NA, 1),
    links = c("log", "identity", "logit"),
    type = "real"
)

scode <- "
/* Marginal hurdle gamma log-PDF of a single response
* logit parameterization of the hurdle part
* Args:
*   y: the response value
*   alpha: shape parameter of the gamma distribution
*   mu: overall mu
*   hu: linear predictor for the hurdle part
* Returns:
*   a scalar to be added to the log posterior
*/
real hurdle_gamma_mtp_lpdf(real y, real alpha, real mu, real hu) {
    if (y == 0) {
        return bernoulli_logit_lpmf(1 | hu);
    } else {
        real beta;
        beta = alpha * exp(-(mu - log1m_inv_logit(hu)));
        return bernoulli_logit_lpmf(0 | hu) +
        gamma_lpdf(y | alpha, beta);
}
}
"
hurdle_gamma_mtp_stanvars <- stanvar(scode = scode,
                                     block = "functions")

As you can see the only real difference is solving for the conditionally positive mean using exp(-(mu - log1m_inv_logit(hu)));

Although the models are highly similar, I'm not sure if this is an easy change with regards to brms internal code.

paul-buerkner commented 5 years ago

I like the idea a lot and it wouldn't actually be too complicated to change the internals.

However, we can't just change the existing families because this would be a huge backwards compatibility break.

One option would be to introduce a new set of families, named for instance hurdle_gamma2 etc. that implements the mean parameterization. What would you think of that option?

rpsychologist commented 5 years ago

Adding new families might make it clear that they are separate but related models with different estimands, as compared to doing hurdle_gamma(marginal = TRUE).

A potential problem with Smith et al specification posted above is that Y | Y > 0 can sometimes be predicted to be unreasonably large for participants with a very low probability of Y > 0, e.g., someone who only report zeros. Maybe there's a smart way to improve the parameterization to avoid this.

paul-buerkner commented 5 years ago

What's the reason behind these unrealistic predictions? Does the standard parameterization not tend to make such predictions (not a rethorical question)?

rpsychologist commented 5 years ago

It happens when Pr(Y = 0) is estimated to be really high. So if the overall mean is small say $2, we can get extreme expected values for the gamma dist since we solve for the conditional positive value, e.g. 2/plogis(-20) = 970 330 393. Even if we estimate this value as highly unlikely, and it might have little effect on the estimates for the overall effect, this behavior still feels undesirable. The standard model is less affected by problems with very large hurdle probs, as the gamma params are not directly solved from the hurdle probs and the prior is on the conditional part.

paul-buerkner commented 5 years ago

I see, so it is good in any case to have both paramaterizations of zero-inflated or hurdle models.

rpsychologist commented 5 years ago

Absolutely, both are highly useful.