paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 183 forks source link

incorrect mixture probabilities returned when mixtures include hurdle families #1552

Closed jsocolar closed 10 months ago

jsocolar commented 11 months ago

While tinkering with https://github.com/paul-buerkner/brms/issues/1551, I realized that mixture() misbehaves as follows:

If I do

mix <- mixture(gaussian, poisson)

I error with

Error: Cannot mix families with real and integer support.

This makes sense.

But if I do:

mix <- mixture(gaussian, zero_one_inflated_beta)

brms doesn't complain. Moreover, if I do

set.seed(2)
dat <- data.frame(
  y = c(0, 1, runif(10))
)
mix <- mixture(gaussian, zero_one_inflated_beta)
fit <- brm(y ~ 1, dat, family = mix)
pp_mixture(fit)[1:3,,]

I get (in addition to a bunch of divergences) the final output

> pp_mixture(fit)[1:3,,]
, , P(K = 1 | Y)

   Estimate  Est.Error      Q2.5     Q97.5
1 0.9318899 0.10426227 0.6236484 0.9997673
2 0.9506989 0.07524888 0.7287995 0.9998412
3 0.9974165 0.02822344 0.9884541 1.0000000

, , P(K = 2 | Y)

     Estimate  Est.Error          Q2.5     Q97.5
1 0.068110140 0.10426227  2.326594e-04 0.3763516
2 0.049301144 0.07524888  1.587643e-04 0.2712005
3 0.002583492 0.02822344 1.282092e-162 0.0115459

This is incorrect; rows 1 and 2 are structurally guaranteed to arise from the zero-one-inflated-beta component which puts probability mass rather than probability density over zero and one.

I think the solution is to prohibit hurdle families as arguments to mixture().

paul-buerkner commented 11 months ago

I see the problem. However, I would not want to prevent these models from being excluded from mixture models because of this. Perhaps a warning could be sufficient.

jsocolar commented 11 months ago

I defer to you (obviously), but want to make sure that it's clear that this isn't a problem in pp_mixture() but rather a problem with the generated stancode.

set.seed(2)
dat <- data.frame(
  y = c(rep(0, 100), runif(10) rnorm(5, .5, .1))
)
mix <- mixture(gaussian, zero_inflated_beta)
make_stancode(y ~ 1, dat, family = mix)

Yields stan code including

    for (n in 1:N) {
      array[2] real ps;
      ps[1] = log(theta1) + normal_lpdf(Y[n] | mu1[n], sigma1);
      ps[2] = log(theta2) + zero_inflated_beta_lpdf(Y[n] | mu2[n], phi2, zi2);
      target += log_sum_exp(ps);
    }

But it's not correct to log_sum_exp through a mixture of probability densities and probability masses. The correct code here would be

    for (n in 1:N) {
      array[2] real ps;
      if(y[n] == 0){
        target += log(theta2) + log(zi2);
      } else {
        ps[1] = log(theta1) + normal_lpdf(Y[n] | mu1[n], sigma1);
        ps[2] = log(theta2) + log1m(zi2) + beta_lpdf(Y[n] | mu2[n], phi2);
        target += log_sum_exp(ps);
      }
    }

Here's a simple example that shows the consequences for parameter recovery:

library(brms)
set.seed(2)
dat <- data.frame(
  y = c(rep(0, 100), runif(20), rnorm(20, .5, .1))
)
mix <- mixture(gaussian, zero_inflated_beta)
prior <- c(
  prior(normal(.5, .05), Intercept, dpar = mu1),
  prior(normal(0, .2), Intercept, dpar = mu2),
  prior(lognormal(1,1), sigma1, lb = 0.01)
)
fit <- brm(y ~ 1, dat, family = mix, 
           prior = prior,
           backend = "cmdstanr", 
           cores = 4)
summary(fit)

I included the prior lb on sigma1 because this model desperately wants to estimate sigma1 on the boundary, which leads to very poor sampling when the boundary is zero.

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
mu1_Intercept     0.00      0.00    -0.00     0.00 1.00     4359     2987
mu2_Intercept     0.06      0.12    -0.17     0.30 1.00     3651     2576

Family Specific Parameters: 
       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma1     0.01      0.00     0.01     0.01 1.00     2869     1761
phi2       3.47      0.68     2.28     4.88 1.00     3765     2475
zi2        0.02      0.02     0.00     0.09 1.00     2789     1638
theta1     0.71      0.04     0.63     0.78 1.00     3981     2731
theta2     0.29      0.04     0.22     0.37 1.00     3981     2731

Even with the lb on sigma1 (not to mention the zero-avoiding prior), this model estimates mu1 as zero; i.e. in the ten-sigma region of its marginal prior. But the normal part of the likelihood should never even see any of the zeros in the data--in the correct Stan code, they would all go straight to the zero-inflation term via the control flow.

I think that pp_mixture() is in fact returning the correct mixture weights given the model that was fitted; i.e. treating the probability of a structural zero via zi as a probability density rather than a probability mass.

jsocolar commented 11 months ago

Note that the Stan code for mixtures including zero_inflated_poisson, zero_inflated_binomial etc is unproblematic, because these families don't mix probability masses and probability densities, and mixture()'s requirement not to mix integer and continuous families will prevent them from being mixed with probability densities later on.

paul-buerkner commented 11 months ago

Ah, that makes sense. So the problem are the continuoues zi/hu families, such as zero_inflated_lognormal?

jsocolar commented 11 months ago

Yup!

paul-buerkner commented 11 months ago

Got it. Will change accordingly.

paul-buerkner commented 10 months ago

Fixed :-)