Sampler appears to avoid zero in truncated hierarchical priors

akerrigan commented 7 years ago

Summary:

Samples generated from a simple hierarchical normal prior (no model, just prior) don't agree with the theoretical distribution or those generated numerically near the truncation point (zero)

Description:

This is from a Discourse thread: http://discourse.mc-stan.org/t/sampler-avoiding-zero-in-hierarchical-prior/

I’m comparing the results from using hmc-nuts to sample directly from a (hierarchical) prior to using normal_rng to generate samples. What I’ve found is that for the scale parameter of the normal distribution, the sampled prior doesn’t agree with the generated one - and for some reason the sampled prior avoids zero.

Here's the code for a minimal example:

parameters {
  real<lower=0> sigma_bar;
  real<lower=0> sigma_sig;
  real<lower=0> sigma;  
}

model {
  sigma_bar ~ normal(0.5,0.5) T[0,];
  sigma_sig ~ normal(0.5,0.5) T[0,];
  sigma ~ normal(sigma_bar,sigma_sig) T[0,]; 
}

generated quantities {
 real<lower=0> s_bar;
 real<lower=0> s_sig;
 real<lower=0> s;
s_bar = -1;
s_sig = -1;
s = -1;
while(s_bar <0)
 s_bar = normal_rng(0.5,0.5);
while(s_sig < 0)
 s_sig = normal_rng(0.5,0.5);
while(s < 0)
 s = normal_rng(s_bar,s_sig);
}

Plotting the output, I find that the distribution of s_sig (the generated quantity) matches the theoretical PDF for the truncated normal perfectly, but sigma_sig (the sampled quantity) has a discrepancy near zero.

sample_vs_generate_update normal

Reproducible Steps:

I'm running MatlabStan - this is my code to generate that figure, more or less - stanfile.stan is the code listed above. Note to be super confident in these histograms, I generate a bunch of samples (2e7) - this takes about ten minutes to run on my machine, creates csv files close to a gig.

dat = struct;
fit = stan('file','stanfile.stan','data',dat,'iter',2e7,'chains',4);
fit.verbose=true
  fit.block()
z = fit.extract('pars',{'sigma_bar','sigma_sig','sigma','s','s_bar','s_sig'});
[n1,x1] = hist(z.s_sig,100);
[n2,x2] = hist(z.sigma_sig,100);
plot(x1,n1/trapz(x1,n1),'LineWidth',2);
hold on
plot(x2,n2/trapz(x2,n2),'LineWidth',3,'color',[0.7 0.7 0.7]);
phi = @(z) (1/sqrt(2*pi))*exp(-0.5*z.^2);
PHI = @(x) (1/2)*(1+erf(x/sqrt(2)));
mu=0.5; sigma=0.5; % params for the truncated normal
pdf_trunc = phi((x2-mu)/sigma)/(sigma*(1-PHI((-mu/sigma))));
plot(x2,pdf_trunc,'r--','LineWidth',2);
set(gca,'FontSize',18);
xlabel('sigma_sig','interpreter','none');
ylabel('Probability density')
legend({'Generated (s_sig)','Sampled (sigma_sig)','True PDF'},'interpreter','none');

Current Version:

I'm doing this with v2.14.0 (this is post sampler-bug-fix) and I don't think the sampler has changed since then. It shouldn't matter, hopefully, but I'm also using MatlabStan

bob-carpenter commented 7 years ago

Thanks for posting a very clear issue. This one's tricky.

@betanalpha and @andrewgelman should take a look, but I don't think the fundamental problem can be fixed.

The problem is that Stan uses a log transform of the value constrained to (0, infinity), which maps zero to negative infinity and values just above zero to really high magnitude negative values. Stan has trouble moving toward -infinity with floating point. Were you getting divergence warnings from the sampler? Did the R-hats make it look like it had converged to the right posterior (I'm looking for a clean example where R-hat gives false positives on convergence)?

This is in general why we try to discourage hard constraints when there is a non-trivial amount of probability mass piled up at the boundary.

The efficiency of the rejection sampling strategy is inversely proportional to how much of the original mass is truncated.

This is a case where a custom (0, infinity) to (-infinity, infinity) transform might help get arbitrarily close to zero. It's something we've never looked into, but a positive example could be the basis for a nice short computational stats paper for a case study and arXiv paper (working title, "Boundary issues").

bob-carpenter commented 7 years ago

The second question to ask is what it does to expectations, which is what you really care about computing. Just by eye, it looks like a pretty big discrepancy out to 0.05 or 0.1, so back of the envelope that looks like it could be 5% bias to the high side in estimating the posterior mean.

Another issue you may be having is that the funnel-like density (in the unconstrained space) that you're defining is hard to explore down into the neck of the funnel, which can be another source of positive bias and usually results in divergence warnings. Have you tried just sampling a single sigma_bar here?

I'm about to try some of these things out of curiousity as well as a different transform for positive values that's linear down to a small epsilon.

sakrejda commented 7 years ago

Thanks for jumping on this, I saw it on the discourse list and I encourage @akerrigan to file it since it seems like a good example of this problem and the fact that the problem starts as high as 0.05/0.1 makes me doubt it's just the transform. I think a tail problem is more likely so I meant to work on it just not this weekend.

bob-carpenter commented 7 years ago

OK, I ran three models to see what the results would look like for the simplest case.

reject.stan

generated quantities {
  real<lower = 0> x = -1;
  while (x < 0)
    x = normal_rng(0.5, 0.5);
}

simple.stan

parameters {
  real<lower = 0> x;
}
model {
  x ~ normal(0.5, 0.5);
}

trunc.stan

parameters {
  real<lower = 0> x;
}
model {
  x ~ normal(0.5, 0.5) T[0, ];
}

I ran them all with 4 chains of 500K warup and 500K sampling with a small initial stepsize and high target acceptance rate. Then I printed the results.

library(rstan)

# passing fit object as argument seems to type convert to make it unusable

fit_reject <- stan("reject.stan", algorithm = "Fixed_param", iter = 1e6)

fit_simple <- stan("simple.stan", iter = 1e6,
                   control = list(adapt_delta = 0.99, stepsize = 0.01))

fit_trunc <- stan("trunc.stan", iter = 1e6,
                  control = list(adapt_delta = 0.99, stepsize = 0.01))

print(fit_reject, pars=c("x"), probs=c(1e-4, 0.5, 1 - 1e-4), digits=5)
summary(extract(fit_reject)$x, digits=5)

print(fit_simple, pars=c("x"), probs=c(1e-4, 0.5, 1 - 1e-4), digits=5)
summary(extract(fit_simple)$x, digits=5)

print(fit_trunc, pars=c("x"), probs=c(1e-4, 0.5, 1 - 1e-4), digits=5)
summary(extract(fit_trunc)$x, digits=5)

Here are the results from Stan, in the order the programs were presented.

     mean se_mean      sd   0.01%    50%   99.99% n_eff Rhat
x 0.64404 0.00028 0.39688 0.00018 0.6006  2.38501 2e+06    1
x 0.64303 0.00069 0.39662 0.00017 0.5994  2.38905 333322 1.00001
x 0.64457 0.00068 0.39756 0.00015 0.60021 2.38471 342098 1.00001

This all looks fine--- even th extreme quantiles agree and the posterior means (mean) all match to within MCMC standard error (se_mean).

But if we look at the summary statistics, we get a different story near zero, and I believe this is what is worrying the original poster @akerrigan:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000003 0.3327600 0.6006000 0.6440400 0.9028400 3.0804000 
0.0000034 0.3316400 0.5994000 0.6430300 0.9022800 2.9697000 
0.0000023 0.3327700 0.6002100 0.6445700 0.9040800 2.9877000

The minimum value being drawn by the rejection sampler is a lot lower. But is that behavior consistent? I ran the whole thing 20 times and dumped out a summary.

minz <- c()
for (i in 1:20) {
  fit_reject <- stan("reject.stan", algorithm = "Fixed_param", iter = 1e6,
                     1e6, seed= (i * 123 + 13))
  minz[i] = min(extract(fit_reject)$x)
}
summary(minz)

and got

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
2.493e-08 2.203e-07 4.350e-07 8.368e-07 1.251e-06 2.468e-06

As to be expected, there's a huge amount of MC error (this isn't even MCMC---these are exact draws) around the minimum value out of 2M draws---the min and max differ by two orders of magnitude!

In conclusion, I don't think there's anything to worry about here as far as using these for MCMC goes. You're not going to be able to get far out into the tail using MCMC no matter what you do.

andrewgelman commented 7 years ago

We discuss this in BDA in the simulation chapter, the idea that even direct simulations are noisy!

On Aug 5, 2017, at 10:42 AM, Bob Carpenter notifications@github.com wrote: In conclusion, I don't think there's anything to worry about here as far as using these for MCMC goes. You're not going to be able to get far out into the tail using MCMC no matter what you do.

akerrigan commented 7 years ago

I agree with @bob-carpenter that for a single variable, the error approaching zero is way less than any sensible MC error, and can probably be ignored.

But for the hierarchical combination I ended up using (and I think should be pretty standard), the error is a lot larger. I've added a zoomed in figure to show that the discrepancy seem to go up to 0.1 - so even at a tenth of the peak value, there is a significant difference between the true distribution and the sampled one. (If you don't truncate explicitly with T[0,], as you the results are much worse - see the Discourse thread - and even simple linear regression will fail the Cook-Gelman-Rubin test.)

bigsamples_zoom

In practice, once I had the variables appropriately truncated, my model has been passing the Cook-Gelman-Rubin calibration test, so I don't think even the hierarchical version is a huge problem. But the bias away from zero is definitely present.

bob-carpenter commented 7 years ago

I think the remaining problem is due to using a centered parameterization. This leads to the notorious "funnel" problem described in the manual, where the posterior scale varies dramatically with sigma. I wrote the obvious non-centered version of your full model and it matches well up to pretty far intervals, though even with a 0.999 target acceptance rate, it still throws a lot of divergence warnings. So I think we need @betanalpha's help here.

functions {
  real positive_half_normal_rng(real mu, real sigma) {
    real y = -1;
    while (y < 0)
      y = normal_rng(mu, sigma);
    return y;
  }
}
parameters {
  real<lower = 0> mu;
  real<lower = 0> sigma;
  real<lower = -mu / sigma> y_raw;
}

model {
  mu ~ normal(0.5, 0.5);  // don't need truncation with constant bounds
  sigma ~ normal(0.5, 0.5);
  y_raw ~ normal(0, 1) T[-mu / sigma, ];  // need truncation with variable bounds
}
generated quantities {
  real<lower = 0> y = mu + sigma * y_raw;
  real<lower = 0> mu_sim = positive_half_normal_rng(0.5, 0.5);
  real<lower = 0> sigma_sim = positive_half_normal_rng(0.5, 0.5);
  real<lower = 0> y_sim = positive_half_normal_rng(mu_sim, sigma_sim);
}

Some notes:

rescaled and recentered to imply y ~ normal(mu, sigma) T[0, ]
don't need truncation on mu or sigma because it's a constant (doesn't depend on params)
need truncation on y_raw because bouns are parameters
you could have declared y as a transformed parameter if you wanted to use it elsewhere in the model

Which I fit with

fit_hier <- stan("hier-simple.stan", iter = 1e6,
                 control = list(adapt_delta = 0.999, stepsize = 0.001));

and it produced this warning

Warning messages:
1: There were 18516 divergent transitions after warmup. Increasing adapt_delta above 0.999 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

That's roughly 20K out of 2M, or about one in a hundred, which is a huge red flag. Nevertheless, the results seem to match the simulation to within MCMC error.

       mean se_mean    sd 0.01%  0.1%   1%   10%   50%   90%   99% 99.9% 99.99%   n_eff Rhat
y     0.900   0.001 0.676     0 0.002 0.02 0.177 0.764 1.781 3.126 4.452  5.811  646447    1
y_sim 0.899   0.000 0.674     0 0.002 0.02 0.176 0.765 1.778 3.116 4.437  5.759 1994882    1

stan-dev / stan