Closed PeterLi2016 closed 7 years ago
For whatever subset of models we choose, we want to compare how HMC/NUTS performs compared to the Ensemble Samplers (stretch/walk). We want to vary the number of parameters, the number of samples per parameter, correlation among the parameters (not sure how to do this), and the scale of the parameters.
For the number of parameters I was thinking of just using 2^n from n=1 to 10 (?) and then for number of samples per parameter, I'm not sure--maybe something along the lines of 2^n from n=0 to 7? I don't know how long that would take to run--generating the data files alone has taken quite a while and it's still not done [some of the larger data files are ~300mb-600mb.
I've started with using a hierarchical logistic regression and I found that with number of samples per parameter < 8, both HMC and ensemble samplers converge to the wrong values--in some cases, some of parameters converge to the right value, but this often is not the case.
Also, is anyone familiar with how to measure memory usage? I've been looking online and I'm finding it hard to follow a lot of the suggestions. In any case, we're going to be running everything through a python script (thanks Mitzi!).
I'm also confused about how we're going to measure time to converge.
For HMC/NUTS, I believe we are just using the warmup time (assuming it's converged by then), but what if it's converged before warmup is over?
Are we doing the same thing for the ensemble samplers--their warmup consists of just running the same algorithm as it does in sampling?
This is exactly the problem of which I was warning yesterday — without knowing when we’ve actually hit stationarity (not that there’s a well-defined cutoff anyways) we can’t rigorously define “warmup time”. My recommendation is to not worry about details and just record what time was used.
On Jul 18, 2014, at 7:53 PM, Peter Li notifications@github.com wrote:
I'm also confused about how we're going to measure time to converge.
For HMC/NUTS, I believe we are just using the warmup time (assuming it's converged by then), but what if it's converged before warmup is over?
Are we doing the same thing for the ensemble samplers--their warmup consists of just running the same algorithm as it does in sampling?
— Reply to this email directly or view it on GitHub.
Exactly --- that's what we were talking about during the meeting and also what Peter and I were talking about yesterday after the meeting.
We'll just run long enough to converge, throw out the warmup altogether for both ensemble and HMC, and then just measure what happens after convergence.
If you have simple model examples that won't converge in NUTS, send them our way (or Michael's way).
On Jul 18, 2014, at 3:20 PM, Michael Betancourt notifications@github.com wrote:
This is exactly the problem of which I was warning yesterday — without knowing when we’ve actually hit stationarity (not that there’s a well-defined cutoff anyways) we can’t rigorously define “warmup time”. My recommendation is to not worry about details and just record what time was used.
On Jul 18, 2014, at 7:53 PM, Peter Li notifications@github.com wrote:
I'm also confused about how we're going to measure time to converge.
For HMC/NUTS, I believe we are just using the warmup time (assuming it's converged by then), but what if it's converged before warmup is over?
Are we doing the same thing for the ensemble samplers--their warmup consists of just running the same algorithm as it does in sampling?
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub.
For instance, this model
data {
int<lower=0> N;
int<lower=0> n;
matrix[N,n] x;
int<lower=0,upper=1> z[N];
}
parameters {
vector[n] beta;
real<lower=0> tau;
}
model {
tau ~ cauchy(0, 5);
beta ~ normal(0, tau);
z ~ bernoulli_logit(x*beta);
}
fails with 1 data point per parameter (i.e. this data:
x <-
structure(c(1, 1, 1, -0.0557144621399927, 1.02089488206537, -2.6239499104236,
-0.315010743320061, 0.611820113275082, -1.8583627822769),
.Dim = c(3, 3))
z <-
c(0, 1, 0)
N <- 3
n <- 3
where correct values are
beta <-
c(2.36082834041899, 0.878581822661834, 0.827042148589953)
tau <- 2
It's only when I get at least 8 data points per parameter that this model gives the right answer.
With HMC or the ensemble method or both? If HMC, what is n_divergent for each run?
With only one data point you’re going to have really pathological posteriors and I wouldn’t be surprised if neither fit using the centered parameterization.
On Jul 21, 2014, at 7:03 PM, Peter Li notifications@github.com wrote:
It's only when I get at least 8 data points per parameter that this model gives the right answer.
— Reply to this email directly or view it on GitHub.
Both. Here is the output for HMC:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -6.5 5.6e-01 3.9e+00 -1.4e+01 -5.5 -2.4 47 186 1.0
accept_stat__ 0.57 8.5e-02 4.0e-01 5.6e-13 0.70 1.0 22 87 1.0
stepsize__ 0.043 5.8e-16 4.1e-16 4.3e-02 0.043 0.043 0.50 2.0 1.00
treedepth__ 2.8 2.8e-01 1.6e+00 1.0e+00 3.0 6.0 33 131 1.0
n_leapfrog__ 13 2.5e+00 4.2e+01 1.0e+00 5.0 35 299 1192 1.0
n_divergent__ 0.015 7.7e-03 1.2e-01 0.0e+00 0.00 0.00 244 973 1.0
beta[1] -4.4 1.2e+00 2.7e+01 -2.0e+01 -1.0 2.9 484 1930 1.0
beta[2] 10 2.3e+00 3.8e+01 -2.2e+00 2.4 47 279 1110 1.0
beta[3] 13 5.1e+00 8.5e+01 -4.6e+00 2.4 40 285 1136 1.00
tau 15 4.4e+00 5.9e+01 1.2e+00 3.9 48 180 718 1.0
and for stretch move ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -12 2.0e-01 1.5e+00 -15 -12 -10 57 19 1.0
accept_stat__ 0.47 5.8e-03 1.4e-01 0.23 0.47 0.71 611 208 1.0
scale__ 2.0 1.1e-15 2.5e-13 2.0 2.0 2.0 50000 17053 1.00
beta[1] -5.7 1.1e+00 8.3e+00 -21 -3.3 0.35 60 20 1.0
beta[2] 15 3.7e+00 2.8e+01 1.2 7.4 46 59 20 1.0
beta[3] 13 3.2e+00 3.2e+01 -0.48 6.6 38 97 33 1.0
tau 18 4.0e+00 3.4e+01 4.2 10 53 71 24 1.0
and for walk move ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -12 1.5e-01 1.2e+00 -1.4e+01 -12 -10 57 17 1.0
accept_stat__ 0.13 6.2e-03 1.0e-01 1.1e-03 0.12 0.32 263 78 1.0
scale__ 2.0 1.1e-15 2.5e-13 2.0e+00 2.0 2.0 50000 14831 1.00
beta[1] -3.6 6.7e-01 3.9e+00 -1.1e+01 -2.8 1.2 33 9.9 1.0
beta[2] 9.5 1.1e+00 7.8e+00 1.2e+00 7.4 25 47 14 1.0
beta[3] 10 1.3e+00 8.2e+00 1.3e+00 8.0 28 39 12 1.1
tau 13 1.2e+00 8.1e+00 4.3e+00 11 28 45 13 1.0
The correct values are the ones in the previous message.
HMC is so wonderful! It tells us when it fails!
Let’s start with tau, which has a HUGE tail — the 95% interval is over five standard deviations from the mean! Because there is very little data, we’re essentially sampling from the prior, which is not efficiently parameterized. This also manifests in the number of divergences which should all be at small tau. These divergences indicate that HMC cannot efficiently explore the model as is (which also means that the ensemble sampler shouldn’t be able to, either).
So, how do we fix it? One thing to do would be to tighten the prior a bit, say cauchy(0, 2.5) or even cauchy(0, 1) instead of cauchy(0, 5). But save that for later — what you really want to do is employ a non-centered parameterization,
data {
int
This will break the strong correlations between tau and the betas and should dramatically improve sampling (at least for HMC).
On Jul 21, 2014, at 9:04 PM, Peter Li notifications@github.com wrote:
Both. Here is the output for HMC:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp -6.5 5.6e-01 3.9e+00 -1.4e+01 -5.5 -2.4 47 186 1.0 accept_stat 0.57 8.5e-02 4.0e-01 5.6e-13 0.70 1.0 22 87 1.0 stepsize 0.043 5.8e-16 4.1e-16 4.3e-02 0.043 0.043 0.50 2.0 1.00 treedepth__ 2.8 2.8e-01 1.6e+00 1.0e+00 3.0 6.0 33 131 1.0 n_leapfrog 13 2.5e+00 4.2e+01 1.0e+00 5.0 35 299 1192 1.0 n_divergent__ 0.015 7.7e-03 1.2e-01 0.0e+00 0.00 0.00 244 973 1.0 beta[1] -4.4 1.2e+00 2.7e+01 -2.0e+01 -1.0 2.9 484 1930 1.0 beta[2] 10 2.3e+00 3.8e+01 -2.2e+00 2.4 47 279 1110 1.0 beta[3] 13 5.1e+00 8.5e+01 -4.6e+00 2.4 40 285 1136 1.00 tau 15 4.4e+00 5.9e+01 1.2e+00 3.9 48 180 718 1.0
and for stretch move ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -12 2.0e-01 1.5e+00 -15 -12 -10 57 19 1.0 accept_stat 0.47 5.8e-03 1.4e-01 0.23 0.47 0.71 611 208 1.0 scale 2.0 1.1e-15 2.5e-13 2.0 2.0 2.0 50000 17053 1.00 beta[1] -5.7 1.1e+00 8.3e+00 -21 -3.3 0.35 60 20 1.0 beta[2] 15 3.7e+00 2.8e+01 1.2 7.4 46 59 20 1.0 beta[3] 13 3.2e+00 3.2e+01 -0.48 6.6 38 97 33 1.0 tau 18 4.0e+00 3.4e+01 4.2 10 53 71 24 1.0
and for walk move ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -12 1.5e-01 1.2e+00 -1.4e+01 -12 -10 57 17 1.0 accept_stat 0.13 6.2e-03 1.0e-01 1.1e-03 0.12 0.32 263 78 1.0 scale 2.0 1.1e-15 2.5e-13 2.0e+00 2.0 2.0 50000 14831 1.00 beta[1] -3.6 6.7e-01 3.9e+00 -1.1e+01 -2.8 1.2 33 9.9 1.0 beta[2] 9.5 1.1e+00 7.8e+00 1.2e+00 7.4 25 47 14 1.0 beta[3] 10 1.3e+00 8.2e+00 1.3e+00 8.0 28 39 12 1.1 tau 13 1.2e+00 8.1e+00 4.3e+00 11 28 45 13 1.0
The correct values are the ones in the previous message.
— Reply to this email directly or view it on GitHub.
Ok, so with the new model, for HMC, I get
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -1.6 7.2e-02 1.5e+00 -4.5 -1.3 0.25 452 5797 1.0
accept_stat__ 0.82 1.0e-02 2.6e-01 0.21 0.95 1.0 637 8179 1.00
stepsize__ 0.34 2.0e-15 1.4e-15 0.34 0.34 0.34 0.50 6.4 1.00
treedepth__ 3.0 2.3e-02 7.2e-01 2.0 3.0 4.0 940 12069 1.0
n_leapfrog__ 7.8 1.2e-01 3.9e+00 3.0 7.0 15 1138 14602 1.00
n_divergent__ 0.038 9.8e-03 1.9e-01 0.00 0.00 0.00 383 4919 1.00
beta_raw[1] 0.94 2.8e-02 7.3e-01 -0.19 0.90 2.2 703 9021 1.0
beta_raw[2] 0.59 2.7e-02 7.3e-01 -0.54 0.54 1.9 725 9300 1.0
beta_raw[3] 0.16 2.7e-02 8.0e-01 -1.1 0.15 1.5 865 11101 1.0
tau 26 7.6e+00 1.3e+02 0.71 5.8 82 286 3667 1.0
beta[1] 26 7.0e+00 1.5e+02 -0.41 4.8 74 434 5573 1.0
beta[2] 16 5.2e+00 1.1e+02 -4.1 2.4 52 446 5721 1.0
beta[3] -4.1 3.2e+00 6.9e+01 -24 0.48 21 464 5959 1.0
and for stretch ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.2 6.2e-02 5.6e-01 -8.2 -7.1 -6.3 81 540 1.00
accept_stat__ 0.58 3.8e-03 1.3e-01 0.36 0.58 0.80 1219 8088 1.00
scale__ 2.0 3.0e-17 1.3e-15 2.0 2.0 2.0 2000 13271 1.00
beta_raw[1] 0.93 3.9e-02 2.2e-01 0.54 0.93 1.3 33 218 1.1
beta_raw[2] 0.54 6.7e-02 3.0e-01 0.10 0.51 1.1 20 135 1.1
beta_raw[3] 0.16 3.6e-02 2.6e-01 -0.26 0.15 0.63 53 349 1.1
tau 28 4.9e+00 6.6e+01 5.3 12 87 178 1179 1.0
beta[1] 27 5.2e+00 6.8e+01 4.0 12 85 170 1128 1.0
beta[2] 13 2.0e+00 2.9e+01 1.2 6.7 41 209 1384 1.0
beta[3] 2.9 1.1e+00 1.4e+01 -5.0 1.7 16 163 1082 1.0
and for walk move ensemble
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.1 9.2e-02 5.6e-01 -8.1 -7.1 -6.3 37 226 1.0
accept_stat__ 0.25 6.4e-03 1.3e-01 0.050 0.24 0.48 414 2516 1.00
scale__ 2.0 3.0e-17 1.3e-15 2.0 2.0 2.0 2000 12148 1.00
beta_raw[1] 0.92 2.4e-02 2.5e-01 0.52 0.91 1.3 109 661 1.0
beta_raw[2] 0.62 1.9e-02 2.2e-01 0.25 0.62 1.00 137 831 1.0
beta_raw[3] 0.19 3.5e-02 2.8e-01 -0.29 0.20 0.63 64 391 1.0
tau 29 5.8e+00 6.9e+01 5.1 12 102 141 859 1.0
beta[1] 27 5.3e+00 6.7e+01 3.4 11 91 159 965 1.0
beta[2] 17 3.3e+00 3.8e+01 2.2 7.3 54 138 838 1.0
beta[3] 7.6 2.4e+00 3.6e+01 -4.4 2.4 27 223 1353 1.00
And even with 2 data points per parameter, I get bad results--but not as bad.
HMC:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -5.1 3.8e-02 1.6e+00 -8.3 -4.8 -3.2 1858 4239 1.0
accept_stat__ 0.83 6.4e-03 2.6e-01 0.13 0.95 1.0 1659 3787 1.0
stepsize__ 0.36 9.5e-15 6.7e-15 0.36 0.36 0.36 0.50 1.1 1.00
treedepth__ 2.9 1.1e-02 7.2e-01 2.0 3.0 4.0 3998 9122 1.0
n_leapfrog__ 6.9 5.0e-02 3.6e+00 3.0 7.0 15 5303 12101 1.00
n_divergent__ 0.015 1.7e-03 1.2e-01 0.00 0.00 0.00 5494 12537 1.0
beta_raw[1] 0.65 1.2e-02 7.2e-01 -0.62 0.70 1.7 3665 8364 1.0
beta_raw[2] 0.73 1.3e-02 7.5e-01 -0.46 0.72 2.0 3232 7376 1.0
beta_raw[3] 0.66 1.3e-02 8.0e-01 -0.67 0.69 1.9 3555 8112 1.0
tau 4.6 4.2e-01 1.5e+01 0.15 1.5 17 1225 2796 1.0
beta[1] 4.1 4.6e-01 1.7e+01 -0.34 1.1 15 1460 3332 1.00
beta[2] 3.4 5.2e-01 2.0e+01 -0.44 1.0 11 1502 3428 1.0
beta[3] 4.2 3.9e-01 1.5e+01 -0.60 0.88 17 1582 3610 1.00
Stretch Ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -9.1 2.1e-02 5.3e-01 -10 -9.1 -8.3 649 802 1.0
accept_stat__ 0.57 1.8e-03 1.3e-01 0.35 0.58 0.79 5631 6952 1.00
scale__ 2.0 9.0e-16 9.0e-14 2.0 2.0 2.0 10000 12345 1.00
beta_raw[1] 0.61 1.8e-02 2.6e-01 0.19 0.61 1.0 204 252 1.0
beta_raw[2] 0.77 1.6e-02 2.5e-01 0.34 0.78 1.2 229 283 1.0
beta_raw[3] 0.63 2.1e-02 2.8e-01 0.15 0.64 1.1 177 218 1.0
tau 9.3 1.4e+00 3.9e+01 1.3 3.3 26 814 1005 1.0
beta[1] 5.7 7.6e-01 2.4e+01 0.40 2.1 16 959 1183 1.00
beta[2] 7.1 9.8e-01 3.0e+01 0.75 2.5 21 961 1186 1.00
beta[3] 5.9 8.0e-01 2.7e+01 0.34 2.1 16 1117 1379 1.00
Walk Ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -9.1 4.6e-02 5.7e-01 -10 -9.1 -8.3 154 167 1.00
accept_stat__ 0.23 2.7e-03 1.2e-01 0.043 0.23 0.45 1971 2136 1.0
scale__ 2.0 9.0e-16 9.0e-14 2.0 2.0 2.0 10000 10839 1.00
beta_raw[1] 0.67 1.7e-02 2.5e-01 0.25 0.68 1.1 218 236 1.0
beta_raw[2] 0.73 1.4e-02 2.6e-01 0.30 0.73 1.2 360 390 1.0
beta_raw[3] 0.65 1.5e-02 2.8e-01 0.20 0.65 1.1 318 345 1.0
tau 8.0 1.6e+00 2.5e+01 1.2 3.2 31 233 252 1.0
beta[1] 5.9 1.3e+00 2.0e+01 0.46 2.2 21 251 272 1.0
beta[2] 5.6 9.5e-01 1.5e+01 0.64 2.3 19 249 270 1.0
beta[3] 5.8 1.3e+00 2.3e+01 0.38 2.1 21 295 320 1.0
You can see that it’s better for small data but still not great. But you’re also still getting divergences (when working there should be no divergences at all) which one would indicate that the sampler is failing. At this point you have to force the step size to be larger (either by hand or by increasing the target acceptance probability). Check out the experiments in the HMC for hierarchical models paper (http://arxiv.org/abs/1312.0906) where I had to push the step size down to ensure reasonably fits as I scanned through the pseudo data size.
On Jul 22, 2014, at 3:52 AM, Peter Li notifications@github.com wrote:
And even with 2 data points per parameter, I get bad results--but not as bad.
HMC:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp -5.1 3.8e-02 1.6e+00 -8.3 -4.8 -3.2 1858 4239 1.0 accept_stat 0.83 6.4e-03 2.6e-01 0.13 0.95 1.0 1659 3787 1.0 stepsize 0.36 9.5e-15 6.7e-15 0.36 0.36 0.36 0.50 1.1 1.00 treedepth__ 2.9 1.1e-02 7.2e-01 2.0 3.0 4.0 3998 9122 1.0 n_leapfrog 6.9 5.0e-02 3.6e+00 3.0 7.0 15 5303 12101 1.00 n_divergent__ 0.015 1.7e-03 1.2e-01 0.00 0.00 0.00 5494 12537 1.0 beta_raw[1] 0.65 1.2e-02 7.2e-01 -0.62 0.70 1.7 3665 8364 1.0 beta_raw[2] 0.73 1.3e-02 7.5e-01 -0.46 0.72 2.0 3232 7376 1.0 beta_raw[3] 0.66 1.3e-02 8.0e-01 -0.67 0.69 1.9 3555 8112 1.0 tau 4.6 4.2e-01 1.5e+01 0.15 1.5 17 1225 2796 1.0 beta[1] 4.1 4.6e-01 1.7e+01 -0.34 1.1 15 1460 3332 1.00 beta[2] 3.4 5.2e-01 2.0e+01 -0.44 1.0 11 1502 3428 1.0 beta[3] 4.2 3.9e-01 1.5e+01 -0.60 0.88 17 1582 3610 1.00 Stretch Ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -9.1 2.1e-02 5.3e-01 -10 -9.1 -8.3 649 802 1.0 accept_stat 0.57 1.8e-03 1.3e-01 0.35 0.58 0.79 5631 6952 1.00 scale 2.0 9.0e-16 9.0e-14 2.0 2.0 2.0 10000 12345 1.00 beta_raw[1] 0.61 1.8e-02 2.6e-01 0.19 0.61 1.0 204 252 1.0 beta_raw[2] 0.77 1.6e-02 2.5e-01 0.34 0.78 1.2 229 283 1.0 beta_raw[3] 0.63 2.1e-02 2.8e-01 0.15 0.64 1.1 177 218 1.0 tau 9.3 1.4e+00 3.9e+01 1.3 3.3 26 814 1005 1.0 beta[1] 5.7 7.6e-01 2.4e+01 0.40 2.1 16 959 1183 1.00 beta[2] 7.1 9.8e-01 3.0e+01 0.75 2.5 21 961 1186 1.00 beta[3] 5.9 8.0e-01 2.7e+01 0.34 2.1 16 1117 1379 1.00 Walk Ensemble:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -9.1 4.6e-02 5.7e-01 -10 -9.1 -8.3 154 167 1.00 accept_stat 0.23 2.7e-03 1.2e-01 0.043 0.23 0.45 1971 2136 1.0 scale 2.0 9.0e-16 9.0e-14 2.0 2.0 2.0 10000 10839 1.00 beta_raw[1] 0.67 1.7e-02 2.5e-01 0.25 0.68 1.1 218 236 1.0 beta_raw[2] 0.73 1.4e-02 2.6e-01 0.30 0.73 1.2 360 390 1.0 beta_raw[3] 0.65 1.5e-02 2.8e-01 0.20 0.65 1.1 318 345 1.0 tau 8.0 1.6e+00 2.5e+01 1.2 3.2 31 233 252 1.0 beta[1] 5.9 1.3e+00 2.0e+01 0.46 2.2 21 251 272 1.0 beta[2] 5.6 9.5e-01 1.5e+01 0.64 2.3 19 249 270 1.0 beta[3] 5.8 1.3e+00 2.3e+01 0.38 2.1 21 295 320 1.0 — Reply to this email directly or view it on GitHub.
Ok, so I tried playing with adapt delta and with defaults HMC gives:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -4.4e+00 3.3e-02 1.6e+00 -7.5 -4.1 -2.5 2209 5865 1.0
accept_stat__ 8.8e-01 4.5e-03 2.2e-01 0.33 0.97 1.0 2258 5994 1.0
stepsize__ 3.9e-01 1.7e-14 1.2e-14 0.39 0.39 0.39 0.50 1.3 1.00
treedepth__ 2.8e+00 8.1e-03 6.1e-01 2.0 3.0 4.0 5558 14754 1.00
n_leapfrog__ 6.6e+00 3.6e-02 3.0e+00 3.0 7.0 15 6984 18540 1.00
n_divergent__ 7.3e-03 1.2e-03 8.5e-02 0.00 0.00 0.00 5329 14145 1.00
beta_raw[1] -2.8e-01 1.2e-02 8.4e-01 -1.7 -0.28 1.1 4951 13143 1.00
beta_raw[2] 3.8e-01 1.3e-02 8.7e-01 -1.1 0.39 1.8 4554 12087 1.00
beta_raw[3] 5.9e-01 1.4e-02 9.7e-01 -1.0 0.62 2.1 4540 12052 1.00
tau 4.2e+00 1.1e+00 4.6e+01 0.076 0.92 11 1716 4556 1.0
beta[1] -1.7e+00 4.3e-01 2.2e+01 -5.6 -0.18 1.2 2704 7178 1.0
beta[2] 2.5e+00 6.4e-01 2.9e+01 -0.97 0.25 7.2 2057 5460 1.0
beta[3] 4.3e+00 1.3e+00 5.4e+01 -0.66 0.45 12 1784 4734 1.0
with adapt delta = .9 HMC gives:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -4.4 3.0e-02 1.5e+00 -7.4 -4.1 -2.5 2669 6539 1.0
accept_stat__ 0.90 4.8e-03 1.9e-01 0.45 0.98 1.0 1553 3805 1.0
stepsize__ 0.33 5.5e-16 3.9e-16 0.33 0.33 0.33 0.50 1.2 1.00
treedepth__ 3.1 1.1e-02 6.8e-01 2.0 3.0 4.0 3772 9241 1.00
n_leapfrog__ 8.1 5.4e-02 3.9e+00 3.0 7.0 15 5288 12956 1.00
n_divergent__ 0.013 4.4e-03 1.2e-01 0.00 0.00 0.00 697 1708 1.0
beta_raw[1] -0.28 1.2e-02 8.3e-01 -1.6 -0.28 1.1 4383 10736 1.0
beta_raw[2] 0.36 1.4e-02 8.6e-01 -1.1 0.37 1.8 3865 9469 1.00
beta_raw[3] 0.59 1.7e-02 9.9e-01 -1.1 0.61 2.2 3496 8564 1.0
tau 11 6.2e+00 1.3e+02 0.077 0.94 14 414 1015 1.0
beta[1] 0.34 1.7e+00 4.7e+01 -5.6 -0.18 1.3 714 1749 1.0
beta[2] 2.4 6.0e-01 3.2e+01 -1.1 0.24 7.4 2881 7057 1.0
beta[3] 13 7.5e+00 1.6e+02 -0.68 0.45 15 472 1155 1.0
and with adapt delta=.99 HMC gives:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -4.4e+00 3.0e-02 1.6e+00 -7.5 -4.1 -2.5 2675 3734 1.00
accept_stat__ 9.7e-01 2.0e-03 9.3e-02 0.89 1.00 1.00 2149 3000 1.0
stepsize__ 1.6e-01 1.3e-14 9.5e-15 0.16 0.16 0.16 0.50 0.70 1.00
treedepth__ 4.1e+00 8.4e-03 7.0e-01 3.0 4.0 5.0 6842 9552 1.00
n_leapfrog__ 1.8e+01 9.2e-02 8.1e+00 7.0 15 31 7785 10869 1.00
n_divergent__ 4.3e-03 1.2e-03 6.5e-02 0.00 0.00 0.00 2752 3842 1.00
beta_raw[1] -3.0e-01 1.3e-02 8.4e-01 -1.7 -0.31 1.1 3926 5481 1.0
beta_raw[2] 3.7e-01 1.3e-02 8.8e-01 -1.1 0.38 1.8 4415 6164 1.0
beta_raw[3] 5.7e-01 1.5e-02 9.7e-01 -1.1 0.59 2.1 4401 6144 1.0
tau 4.8e+00 9.7e-01 3.9e+01 0.072 0.93 11 1622 2265 1.0
beta[1] -2.5e+00 7.9e-01 3.5e+01 -6.3 -0.20 1.1 1967 2746 1.0
beta[2] 2.7e+00 6.5e-01 2.6e+01 -0.91 0.27 7.9 1589 2219 1.0
beta[3] 4.9e+00 1.2e+00 5.4e+01 -0.68 0.40 13 2151 3004 1.0
where correct values are
beta_actual <-
c(1.25749972917763, 1.11573450571151, 1.41430662995209)
tau_actual <- 2
As a side note, is it better if I generate beta ~ normal(mu, sigma) or just keep what I have now (beta ~ normal(0, tau) )?
You can't expect to get the exact parameters back with just a couple of samples.
Just consider drawing two samples from normal(0,1). The MLE will just be their average, and that's unlikely to be 0. Similarly, the Bayesian posterior is unlikely to be centered on 0.
It's not until you get a whole lot of samples that you can expect to be near 0 in your MLE or posterior mean/median.
What you're looking for is N% of the values being in the N% posterior intervals. And this is pretty close, with 3 out of 4 and one very close to being in the 95% interval.
For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.
P.S. There's also huge skew --- look at how far the medians are from the means.
On Jul 22, 2014, at 11:52 AM, Peter Li notifications@github.com wrote:
Ok, so I tried playing with adapt delta and with defaults HMC gives:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp -4.4e+00 3.3e-02 1.6e+00 -7.5 -4.1 -2.5 2209 5865 1.0 accept_stat 8.8e-01 4.5e-03 2.2e-01 0.33 0.97 1.0 2258 5994 1.0 stepsize 3.9e-01 1.7e-14 1.2e-14 0.39 0.39 0.39 0.50 1.3 1.00 treedepth__ 2.8e+00 8.1e-03 6.1e-01 2.0 3.0 4.0 5558 14754 1.00 n_leapfrog 6.6e+00 3.6e-02 3.0e+00 3.0 7.0 15 6984 18540 1.00 n_divergent__ 7.3e-03 1.2e-03 8.5e-02 0.00 0.00 0.00 5329 14145 1.00 beta_raw[1] -2.8e-01 1.2e-02 8.4e-01 -1.7 -0.28 1.1 4951 13143 1.00 beta_raw[2] 3.8e-01 1.3e-02 8.7e-01 -1.1 0.39 1.8 4554 12087 1.00 beta_raw[3] 5.9e-01 1.4e-02 9.7e-01 -1.0 0.62 2.1 4540 12052 1.00 tau 4.2e+00 1.1e+00 4.6e+01 0.076 0.92 11 1716 4556 1.0 beta[1] -1.7e+00 4.3e-01 2.2e+01 -5.6 -0.18 1.2 2704 7178 1.0 beta[2] 2.5e+00 6.4e-01 2.9e+01 -0.97 0.25 7.2 2057 5460 1.0 beta[3] 4.3e+00 1.3e+00 5.4e+01 -0.66 0.45 12 1784 4734 1.0
with adapt delta = .9 HMC gives:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp -4.4 3.0e-02 1.5e+00 -7.4 -4.1 -2.5 2669 6539 1.0 accept_stat 0.90 4.8e-03 1.9e-01 0.45 0.98 1.0 1553 3805 1.0 stepsize 0.33 5.5e-16 3.9e-16 0.33 0.33 0.33 0.50 1.2 1.00 treedepth__ 3.1 1.1e-02 6.8e-01 2.0 3.0 4.0 3772 9241 1.00 n_leapfrog 8.1 5.4e-02 3.9e+00 3.0 7.0 15 5288 12956 1.00 n_divergent__ 0.013 4.4e-03 1.2e-01 0.00 0.00 0.00 697 1708 1.0 beta_raw[1] -0.28 1.2e-02 8.3e-01 -1.6 -0.28 1.1 4383 10736 1.0 beta_raw[2] 0.36 1.4e-02 8.6e-01 -1.1 0.37 1.8 3865 9469 1.00 beta_raw[3] 0.59 1.7e-02 9.9e-01 -1.1 0.61 2.2 3496 8564 1.0 tau 11 6.2e+00 1.3e+02 0.077 0.94 14 414 1015 1.0 beta[1] 0.34 1.7e+00 4.7e+01 -5.6 -0.18 1.3 714 1749 1.0 beta[2] 2.4 6.0e-01 3.2e+01 -1.1 0.24 7.4 2881 7057 1.0 beta[3] 13 7.5e+00 1.6e+02 -0.68 0.45 15 472 1155 1.0
and with adapt delta=.99 HMC gives:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp -4.4e+00 3.0e-02 1.6e+00 -7.5 -4.1 -2.5 2675 3734 1.00 accept_stat 9.7e-01 2.0e-03 9.3e-02 0.89 1.00 1.00 2149 3000 1.0 stepsize 1.6e-01 1.3e-14 9.5e-15 0.16 0.16 0.16 0.50 0.70 1.00 treedepth__ 4.1e+00 8.4e-03 7.0e-01 3.0 4.0 5.0 6842 9552 1.00 n_leapfrog 1.8e+01 9.2e-02 8.1e+00 7.0 15 31 7785 10869 1.00 n_divergent__ 4.3e-03 1.2e-03 6.5e-02 0.00 0.00 0.00 2752 3842 1.00 beta_raw[1] -3.0e-01 1.3e-02 8.4e-01 -1.7 -0.31 1.1 3926 5481 1.0 beta_raw[2] 3.7e-01 1.3e-02 8.8e-01 -1.1 0.38 1.8 4415 6164 1.0 beta_raw[3] 5.7e-01 1.5e-02 9.7e-01 -1.1 0.59 2.1 4401 6144 1.0 tau 4.8e+00 9.7e-01 3.9e+01 0.072 0.93 11 1622 2265 1.0 beta[1] -2.5e+00 7.9e-01 3.5e+01 -6.3 -0.20 1.1 1967 2746 1.0 beta[2] 2.7e+00 6.5e-01 2.6e+01 -0.91 0.27 7.9 1589 2219 1.0 beta[3] 4.9e+00 1.2e+00 5.4e+01 -0.68 0.40 13 2151 3004 1.0
where correct values are
beta_actual <- c(1.25749972917763, 1.11573450571151, 1.41430662995209) tau_actual <- 2
As a side note, is it better if I generate beta ~ normal(mu, sigma) or just keep what I have now (beta ~ normal(0, tau) )?
— Reply to this email directly or view it on GitHub.
For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.
I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.
Anyways, can you get 4-5 datum per group to fit reasonably well? If so, just start there.
But the thing is, I'm running these for 10,000 warmup iterations and then 10,000 sampling iterations, so I thought that the results would have converged by then. Instead, with 1 data per parameter, I was finding a lot of variation from run to run in the results. In any case, with 4 data per parameter, the results are much more consistent from run to run and even across the samplers.
The correct values are
beta_actual <-
c(1.25749972917763, 1.11573450571151, 1.41430662995209)
tau_actual <- 2
HMC gives
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -8.6e+00 4.4e-02 1.8e+00 -1.2e+01 -8.2e+00 -6.4 1700 3787 1.00
accept_stat__ 7.9e-01 4.8e-03 2.7e-01 1.2e-01 9.2e-01 1.0 3136 6986 1.00
stepsize__ 4.7e-01 2.9e-15 2.1e-15 4.7e-01 4.7e-01 0.47 0.50 1.1 1.00
treedepth__ 2.5e+00 8.7e-03 6.4e-01 1.0e+00 3.0e+00 3.0 5342 11900 1.00
n_leapfrog__ 5.3e+00 3.3e-02 2.5e+00 1.0e+00 7.0e+00 7.0 5654 12595 1.00
n_divergent__ 6.0e-04 2.4e-04 2.4e-02 0.0e+00 0.0e+00 0.00 10000 22279 1.0
beta_raw[1] -5.4e-02 1.1e-02 6.7e-01 -1.2e+00 -1.7e-02 0.98 3432 7647 1.00
beta_raw[2] 3.3e-01 1.3e-02 7.6e-01 -8.3e-01 2.7e-01 1.7 3599 8018 1.00
beta_raw[3] 1.1e+00 1.4e-02 8.1e-01 -2.2e-01 1.1e+00 2.4 3546 7901 1.00
tau 1.3e+00 3.1e-02 1.2e+00 1.5e-01 1.0e+00 3.8 1562 3480 1.00
beta[1] 7.5e-02 2.1e-02 8.7e-01 -1.0e+00 -9.5e-03 1.4 1773 3950 1.00
beta[2] 2.5e-01 1.6e-02 9.0e-01 -1.2e+00 2.2e-01 1.7 3288 7324 1.00
beta[3] 1.6e+00 4.9e-02 2.0e+00 -6.4e-02 1.1e+00 5.0 1580 3521 1.00
Stretch Ensemble gives
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -1.2e+01 4.6e-02 6.1e-01 -1.3e+01 -1.2e+01 -12 178 203 1.0
accept_stat__ 5.8e-01 1.7e-03 1.3e-01 3.6e-01 5.8e-01 0.79 5802 6590 1.0
scale__ 2.0e+00 9.0e-16 9.0e-14 2.0e+00 2.0e+00 2.0 10000 11358 1.00
beta_raw[1] -3.5e-02 2.5e-02 2.3e-01 -4.2e-01 -3.0e-02 0.34 86 97 1.0
beta_raw[2] 3.6e-01 2.3e-02 2.4e-01 -1.5e-02 3.5e-01 0.76 110 125 1.0
beta_raw[3] 1.1e+00 1.7e-02 2.5e-01 7.2e-01 1.1e+00 1.5 212 241 1.00
tau 1.4e+00 3.5e-02 4.6e-01 7.8e-01 1.3e+00 2.2 167 190 1.0
beta[1] -3.3e-02 3.3e-02 3.0e-01 -5.2e-01 -3.9e-02 0.47 83 94 1.0
beta[2] 4.7e-01 3.0e-02 3.4e-01 -2.3e-02 4.5e-01 1.0 130 147 1.0
beta[3] 1.6e+00 4.7e-02 6.3e-01 7.4e-01 1.5e+00 2.7 186 211 1.0
Walk Ensemble gives
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -1.2e+01 5.1e-02 6.4e-01 -1.4e+01 -1.2e+01 -12 160 168 1.00
accept_stat__ 2.4e-01 2.6e-03 1.2e-01 4.8e-02 2.3e-01 0.46 2206 2317 1.0
scale__ 2.0e+00 9.0e-16 9.0e-14 2.0e+00 2.0e+00 2.0 10000 10500 1.00
beta_raw[1] -5.8e-02 1.2e-02 2.2e-01 -4.4e-01 -5.3e-02 0.30 348 365 1.0
beta_raw[2] 3.3e-01 1.4e-02 2.5e-01 -6.9e-02 3.2e-01 0.76 320 336 1.0
beta_raw[3] 1.1e+00 1.5e-02 2.6e-01 6.4e-01 1.1e+00 1.5 324 340 1.0
tau 1.4e+00 3.4e-02 4.8e-01 7.8e-01 1.3e+00 2.2 194 204 1.0
beta[1] -5.9e-02 1.5e-02 3.0e-01 -5.4e-01 -7.2e-02 0.45 381 401 1.00
beta[2] 4.4e-01 1.9e-02 3.6e-01 -1.0e-01 4.2e-01 1.0 371 389 1.0
beta[3] 1.5e+00 4.2e-02 6.5e-01 6.3e-01 1.4e+00 2.7 242 254 1.0
On Jul 23, 2014, at 3:51 AM, Michael Betancourt notifications@github.com wrote:
For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.
I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.
Isn't that only true as the number of data points approaches infinity? Maybe I'm misunderstanding.
Suppose you have a model with known scale:
y ~ normal(mu,1);
and no prior on mu. Now suppose you simulate a couple data points
y_sim ~ normal(0,1);
If y_sim = {-1, -2}, how is (mean - 0) / MCSE going to converge to 0? In fact, that quantity looks like it'll diverge, because mean << 0, and MCSE -> 0.
This kind of behavior is exactly why geometric ergodicity is so important. Any MCMC estimate will converge to the true answer with an infinite number of iterations, but without geometric ergodicity we don't have any idea of how the estimate converges. In particular, it could wanter for millions of iterations or thrash between two values and never really converge for any finite set of MCMC samples. Geometric ergodicity gives us the Monte Carlo Central Limit Theorem which gives meaning to the Monte Carlo Standard Error.
Long story short, if you see divergences then you can't expect expectations to converge to the true values in any reasonable amount of time. Hopefully we'll have a paper on this out in a few months.
On Wed, Jul 23, 2014 at 5:06 PM, Bob Carpenter notifications@github.com wrote:
On Jul 23, 2014, at 3:51 AM, Michael Betancourt notifications@github.com wrote:
For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.
I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.
Isn't that only true as the number of data points approaches infinity? Maybe I'm misunderstanding.
Suppose you have a model with known scale:
y ~ normal(mu,1);
and no prior on mu. Now suppose you simulate a couple data points
y_sim ~ normal(0,1);
If y_sim = {-1, -2}, how is (mean - 0) / MCSE going to converge to 0? In fact, that quantity looks like it'll diverge, because mean << 0, and MCSE -> 0.
- Bob
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-49895710.
But that didn't answer my question.
The "true" answer for the posterior isn't the simulated parameters, but whatever's determined from the data. Now the true parameters should be in the 95% posterior intervals 95% of the time, but I think that's very different from what you're suggesting.
The divergences they're seeing are like 1 every 1000 iterations. Is that too many?
On Jul 23, 2014, at 12:23 PM, Michael Betancourt notifications@github.com wrote:
This kind of behavior is exactly why geometric ergodicity is so important. Any MCMC estimate will converge to the true answer with an infinite number of iterations, but without geometric ergodicity we don't have any idea of how the estimate converges. In particular, it could wanter for millions of iterations or thrash between two values and never really converge for any finite set of MCMC samples. Geometric ergodicity gives us the Monte Carlo Central Limit Theorem which gives meaning to the Monte Carlo Standard Error.
Long story short, if you see divergences then you can't expect expectations to converge to the true values in any reasonable amount of time. Hopefully we'll have a paper on this out in a few months.
On Wed, Jul 23, 2014 at 5:06 PM, Bob Carpenter notifications@github.com wrote:
On Jul 23, 2014, at 3:51 AM, Michael Betancourt notifications@github.com wrote:
For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.
I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.
Isn't that only true as the number of data points approaches infinity? Maybe I'm misunderstanding.
Suppose you have a model with known scale:
y ~ normal(mu,1);
and no prior on mu. Now suppose you simulate a couple data points
y_sim ~ normal(0,1);
If y_sim = {-1, -2}, how is (mean - 0) / MCSE going to converge to 0? In fact, that quantity looks like it'll diverge, because mean << 0, and MCSE -> 0.
- Bob
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-49895710.
— Reply to this email directly or view it on GitHub.
But that didn't answer my question.
The "true" answer for the posterior isn't the simulated parameters, but whatever's determined from the data. Now the true parameters should be in the 95% posterior intervals 95% of the time, but I think that's very different from what you're suggesting.
Ah, I see what you’re saying. There’s no guarantee that Bayesian probabilities will have proper Frequentist coverage, so you can’t rely on a test like that. Especially with strong priors.
The divergences they're seeing are like 1 every 1000 iterations. Is that too many?
It’s tough — technically any divergences signal problems, but the resulting bias might be much smaller than the MCSE. The Rhat test discussed in the hierarchal paper might be the most rigorous test here.
Ok, so the graphics are uploaded here: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=sharing
I'll try running the the hier_logistic model with the rest of the data files this weekend--I may end up reducing the number of runs or removing the data files with the most number of samples per data file if it takes too long.
Or you can cut down on the number of variations you do.
Having said that, the variation with data size is very interesting.
I'm most curious to see what happens when the number of parameters gets bigger and the ensembles need more particles.
On Jul 25, 2014, at 3:54 PM, Peter Li notifications@github.com wrote:
Ok, so the graphics are uploaded her: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=sharing
I'll try running the the hier_logistic model with the rest of the data files this weekend--I may end up reducing the number of runs or removing the data files with the most number of samples per data file if it takes too long.
— Reply to this email directly or view it on GitHub.
Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.
I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?
The link didn't work for me.
The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.
You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).
On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:
Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.
I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?
— Reply to this email directly or view it on GitHub.
That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.
I'll try running the larger models, though it might take a while.
On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:
The link didn't work for me.
The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.
You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).
- Bob
On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:
Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.
I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.
You should play around with the memory comparisons — the zero offset is misleading. Certainly use some big unit, say n * 500 MB, or define some default process and normalize everything to that.
The N_eff/s for tau for the walkers is hugely skewed by large values which are almost definitely errors in the calculation (probably the chain getting stuck and giving negligible variances). At the very least filter out the tail events (or just define some max at, say, 1000) so we can see the good stuff.
The large k (parameters), large N (data) limit isn’t particularly relevant to practice, which is typically large k, small N, so I think it’s fair to not run tests out there, but if you do consider running a centered parameterization HMC, too, which will perform much better.
On Jul 29, 2014, at 5:25 PM, Peter Li notifications@github.com wrote:
That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.
I'll try running the larger models, though it might take a while.
On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:
The link didn't work for me.
The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.
You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).
- Bob
On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:
Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.
I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.
— Reply to this email directly or view it on GitHub.
That one worked. Thanks.
On Jul 29, 2014, at 12:25 PM, Peter Li notifications@github.com wrote:
That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.
I'll try running the larger models, though it might take a while.
On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:
The link didn't work for me.
The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.
You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).
- Bob
On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:
Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.
I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.
— Reply to this email directly or view it on GitHub.
This is great. You can see the relevant issues already with what you have. As the number of parameters increases, you can see NUTS start to dominate. With few parameters, you can see the crossover where the relative speeds change based on the number of data points per parameter. (And as Michael says, it'd be nice to see the centered vs. non-centered parameterization comparisons, as well.)
This also shows how relatively bad HMC is at estimating the hierarchical variance. I don't know if there are other parameterization tricks to help with this, but it's problematic for us as is, I think.
Would it make sense to use ratios of n_eff/s or perhaps better, s/n_eff, rather than differences?
It'd also be great if you could get tau into the summary plots of parameter fits.
I'm also surprised all the RAM usage is so close. Are you running through R or some other system that involves saving the samples? I wouldn't expect them to be so close otherwise.
Also, what is warmup time measuring? Just the raw amount of time spent in the warmup phase? I think it would be nice for at least a couple of cases to get an idea of how much more quickly HMC converges. If you use 1000 warmup iterations for both, that's probably overkill for HMC, so comparing basic warmup times isn't so informative when the convergence happens early on during warmup.
On Jul 29, 2014, at 12:25 PM, Peter Li notifications@github.com wrote:
That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.
I'll try running the larger models, though it might take a while.
On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:
The link didn't work for me.
The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.
You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).
- Bob
On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:
Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.
I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.
— Reply to this email directly or view it on GitHub.
Ok, I've got everything working on the clusters now, so hopefully the hierarchical logistic modes will be done running in a few days.
For hierarchical linear model, I'm finding that even for number of data per param, stretch/walk ensemble and HMC are both converging to similar values, like
for HMC:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -1.8e+02 2.3e-01 2.2e+00 -1.9e+02 -1.8e+02 -180 92 108 1.0
accept_stat__ 8.2e-01 2.7e-02 2.8e-01 8.8e-02 9.6e-01 1.0 110 129 1.00
stepsize__ 3.2e-01 1.1e-15 7.8e-16 3.2e-01 3.2e-01 0.32 0.50 0.59 1.00
treedepth__ 3.1e+00 5.8e-02 8.1e-01 2.0e+00 3.0e+00 4.0 197 230 1.0
n_leapfrog__ 8.7e+00 2.9e-01 4.8e+00 3.0e+00 7.0e+00 15 265 311 1.0
n_divergent__ 9.0e-03 3.3e-03 9.4e-02 0.0e+00 0.0e+00 0.00 830 972 1.0
beta_raw[1] 4.9e-01 3.4e-02 7.1e-01 -7.1e-01 4.8e-01 1.7 445 521 1.0
beta_raw[2] 5.0e-02 4.5e-02 8.1e-01 -1.4e+00 7.7e-02 1.4 325 381 1.00
beta_raw[3] -1.6e-01 3.2e-02 7.2e-01 -1.3e+00 -1.7e-01 1.1 525 614 1.0
tau 8.5e-02 1.1e-02 9.0e-02 5.2e-03 5.2e-02 0.30 67 79 1.0
sigma 9.7e-01 1.7e-03 3.5e-02 9.1e-01 9.6e-01 1.0 425 498 1.0
beta[1] 3.7e-02 2.5e-03 4.2e-02 -1.6e-02 3.1e-02 0.11 291 341 1.00
beta[2] 6.4e-03 2.6e-03 4.9e-02 -6.5e-02 2.8e-03 0.093 354 415 1.0
beta[3] -9.9e-03 2.3e-03 4.7e-02 -9.4e-02 -6.0e-03 0.064 429 503 1.0
and stretch
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -5.4e+02 6.2e-02 6.6e-01 -5.4e+02 -5.4e+02 -540 114 59 1.00
accept_stat__ 5.1e-01 1.6e-03 1.2e-01 3.1e-01 5.1e-01 0.71 5441 2804 1.0
scale__ 2.0e+00 9.0e-16 9.0e-14 2.0e+00 2.0e+00 2.0 10000 5154 1.00
beta_raw[1] 5.3e-01 2.3e-02 2.2e-01 1.8e-01 5.2e-01 0.89 93 48 1.00
beta_raw[2] 3.0e-02 2.4e-02 2.5e-01 -3.9e-01 2.9e-02 0.44 107 55 1.0
beta_raw[3] -9.1e-02 2.4e-02 2.5e-01 -4.8e-01 -1.0e-01 0.34 107 55 1.0
tau 8.6e-02 5.9e-03 4.4e-02 4.1e-02 7.5e-02 0.16 54 28 1.0
sigma 9.7e-01 8.3e-04 1.0e-02 9.5e-01 9.7e-01 0.99 149 77 1.0
beta[1] 4.5e-02 3.3e-03 2.9e-02 1.2e-02 3.9e-02 0.096 76 39 1.0
beta[2] 3.1e-03 2.3e-03 2.3e-02 -3.2e-02 2.1e-03 0.040 108 56 1.0
beta[3] -7.5e-03 2.3e-03 2.3e-02 -4.3e-02 -7.8e-03 0.027 101 52 1.0
where the correct values are
beta_actual <-
c(-2.54587437058592, -2.81109369156363, -2.24446230109798)
And I generated the data using the similar to how I did for logistic regression:
n.data.per.param <- 2^(1 + 1:6)
for (j in 1:8) {
P <- ncol(cholesky.corr[[j]])+1
tau_actual <- 2
beta_actual <- rnorm(P, 0, tau_actual)
sigma_actual <- 0.1
for (k in 1:length(n.data.per.param)) {
num.iter <- (P * n.data.per.param[k]
x <- matrix(data=NA, ncol = P, nrow=num.iter)
for (i in 1:num.iter) {
temp.norm <- rnorm(P-1, mean=0, sd=1)
cor.rv <- cholesky.corr[[j]] %*% temp.norm
x[i,] <- c(1,cor.rv)
}
z.temp <- x %*% beta_actual
z <- rnorm(z.temp,sigma_actual)
N <- length(z)
n <- ncol(x)
list <- c("x","z", "N", "n","beta_actual","tau_actual","sigma_actual")
stan_rdump(list, file=paste("hier_linear_",j,"_",k,".data.R",sep=""))
}
}
On Aug 12, 2014, at 1:06 PM, Peter Li notifications@github.com wrote:
Ok, I've got everything working on the clusters now, so hopefully the hierarchical logistic modes will be done running in a few days.
For hierarchical linear model, I'm finding that even for number of data per param, stretch/walk ensemble and HMC are both converging to similar values, like
for HMC: ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 3.7e-02 2.5e-03 4.2e-02 -1.6e-02 3.1e-02 0.11 291 341 1.00 beta[2] 6.4e-03 2.6e-03 4.9e-02 -6.5e-02 2.8e-03 0.093 354 415 1.0 beta[3] -9.9e-03 2.3e-03 4.7e-02 -9.4e-02 -6.0e-03 0.064 429 503 1.0
and stretch ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 4.5e-02 3.3e-03 2.9e-02 1.2e-02 3.9e-02 0.096 76 39 1.0 beta[2] 3.1e-03 2.3e-03 2.3e-02 -3.2e-02 2.1e-03 0.040 108 56 1.0 beta[3] -7.5e-03 2.3e-03 2.3e-02 -4.3e-02 -7.8e-03 0.027 101 52 1.0
where the correct values are
beta_actual <- c(-2.54587437058592, -2.81109369156363, -2.24446230109798)
I take it the StdDev and 5% and 95% are for the ensemble means?
Why are the values of beta so close to 0 when the actual values are close to -2.5?
On Wed, Aug 13, 2014 at 12:05 AM, Bob Carpenter notifications@github.com wrote:
On Aug 12, 2014, at 1:06 PM, Peter Li notifications@github.com wrote:
Ok, I've got everything working on the clusters now, so hopefully the hierarchical logistic modes will be done running in a few days.
For hierarchical linear model, I'm finding that even for number of data per param, stretch/walk ensemble and HMC are both converging to similar values, like
for HMC: ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 3.7e-02 2.5e-03 4.2e-02 -1.6e-02 3.1e-02 0.11 291 341 1.00 beta[2] 6.4e-03 2.6e-03 4.9e-02 -6.5e-02 2.8e-03 0.093 354 415 1.0 beta[3] -9.9e-03 2.3e-03 4.7e-02 -9.4e-02 -6.0e-03 0.064 429 503 1.0
and stretch ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 4.5e-02 3.3e-03 2.9e-02 1.2e-02 3.9e-02 0.096 76 39 1.0 beta[2] 3.1e-03 2.3e-03 2.3e-02 -3.2e-02 2.1e-03 0.040 108 56 1.0 beta[3] -7.5e-03 2.3e-03 2.3e-02 -4.3e-02 -7.8e-03 0.027 101 52 1.0
where the correct values are
beta_actual <- c(-2.54587437058592, -2.81109369156363, -2.24446230109798)
I take it the StdDev and 5% and 95% are for the ensemble means?
Yes.
Why are the values of beta so close to 0 when the actual values are close to -2.5?
I have no idea. This was with a decent amount of data points per parameter too--I think around 8.
- Bob
— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-52007662.
I'm uploading the graphs here. It might take an hour or so for it to finish uploading.
The first tier of numbers for the folders represents the number of beta parameters (i.e. 1 + 2^number).
The second tier of numbers for the folders represents the number of data points per parameter (i.e. 2^(1+number))
Some of the ensemble runs took too much wall time still, so there are some graphs with only HMC data. This is in the respective param folders. For these runs, the ram, sampling_times, warmup_times folders may still show ensemble data but this is in error that I still need to fix. The summary folders are empty since I wasn't sure how best to combine the plots for all of the parameters into one -- I can't put them all on 1 page anymore since there are so many more parameters in some cases.
A few other notes: NUTS pretty much always converged. Stretch ensemble mostly converges, though it takes a lot more iterations and wall time than NUTS typically. Walk ensemble was unable to converge in a lot of cases (I used the same number of iterations as I did for stretch ensemble); this typically happened after the number of parameters increased to 9 or more.
Sweeeeeet.
On Sep 22, 2014, at 9:13 PM, Peter Li notifications@github.com wrote:
I'm uploading the graphs here. It might take an hour or so for it to finish uploading.
The first tier of numbers for the folders represents the number of beta parameters (i.e. 1 + 2^number).
The second tier of numbers for the folders represents the number of data points per parameter (i.e. 2^(1+number))
Some of the ensemble runs took too much wall time still, so there are some graphs with only HMC data. This is in the respective param folders. For these runs, the ram, sampling_times, warmup_times folders may still show ensemble data but this is in error that I still need to fix. The summary folders are empty since I wasn't sure how best to combine the plots for all of the parameters into one -- I can't put them all on 1 page anymore since there are so many more parameters in some cases.
A few other notes: NUTS pretty much always converged. Stretch ensemble mostly converges, though it takes a lot more iterations and wall time than NUTS typically. Walk ensemble was unable to converge in a lot of cases (I used the same number of iterations as I did for stretch ensemble); this typically happened after the number of parameters increased to 9 or more.
— Reply to this email directly or view it on GitHub.
This will soon be a published paper, I’m sure… On Sep 22, 2014, at 4:18 PM, Michael Betancourt notifications@github.com wrote:
Sweeeeeet.
On Sep 22, 2014, at 9:13 PM, Peter Li notifications@github.com wrote:
I'm uploading the graphs here. It might take an hour or so for it to finish uploading.
The first tier of numbers for the folders represents the number of beta parameters (i.e. 1 + 2^number).
The second tier of numbers for the folders represents the number of data points per parameter (i.e. 2^(1+number))
Some of the ensemble runs took too much wall time still, so there are some graphs with only HMC data. This is in the respective param folders. For these runs, the ram, sampling_times, warmup_times folders may still show ensemble data but this is in error that I still need to fix. The summary folders are empty since I wasn't sure how best to combine the plots for all of the parameters into one -- I can't put them all on 1 page anymore since there are so many more parameters in some cases.
A few other notes: NUTS pretty much always converged. Stretch ensemble mostly converges, though it takes a lot more iterations and wall time than NUTS typically. Walk ensemble was unable to converge in a lot of cases (I used the same number of iterations as I did for stretch ensemble); this typically happened after the number of parameters increased to 9 or more.
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub.
After working some sweet, sweet template magic the refactor compile time is on par with the develop compile time (both just under 12 seconds for a simple model with very little autodiff to compile). BUT we can do better.
Right now there�s a nontrivial hit from having to pass a Writer through all of the samplers. But the Writer is used only once, when a Hamiltonian class tries to compute a gradient and spits out the �about to be rejected� error message on failure. In particular, the Hamiltonian class wraps the gradient call in a try/catch block, and if it catches an exception then it sets the potential to infinity (which will allow the sampler to gracefully terminate) and spits out the error message.
In order to avoid a Writer template we�d have to either pass the writer through each call (sample -> transition -> evolve -> components of the leapfrog update -> gradient) or not catch the exception internally (in which case the sampler won�t be able to fail as gracefully). Both seem ugly and unhappy.
Anyone have any suggestions on how to sneak an error message out from deep in the code without having an internal reference to a Writer implementation? I dunno, like storing an error string and calling write_error_message after each transition or something?
One thing to do is wrap all the components that need to get plumbed into the depths into a struct so the passing isn't so onerous.
I take it it's not possible just have the Hamiltonian class throw an exception that's caught above?
On Mar 31, 2015, at 1:02 AM, Michael Betancourt notifications@github.com wrote:
After working some sweet, sweet template magic the refactor compile time is on par with the develop compile time (both just under 12 seconds for a simple model with very little autodiff to compile). BUT we can do better.
Right now there�s a nontrivial hit from having to pass a Writer through all of the samplers. But the Writer is used only once, when a Hamiltonian class tries to compute a gradient and spits out the �about to be rejected� error message on failure. In particular, the Hamiltonian class wraps the gradient call in a try/catch block, and if it catches an exception then it sets the potential to infinity (which will allow the sampler to gracefully terminate) and spits out the error message.
In order to avoid a Writer template we�d have to either pass the writer through each call (sample -> transition -> evolve -> components of the leapfrog update -> gradient) or not catch the exception internally (in which case the sampler won�t be able to fail as gracefully). Both seem ugly and unhappy.
Anyone have any suggestions on how to sneak an error message out from deep in the code without having an internal reference to a Writer implementation? I dunno, like storing an error string and calling write_error_message after each transition or something? — Reply to this email directly or view it on GitHub.
One thing to do is wrap all the components that need to get plumbed into the depths into a struct so the passing isn't so onerous.
It�s just one object that needs to be passed, it just needs to be passed through a ton of functions since the log_prob call is deep inside of all the MCMC code relative to the transition method.
I take it it's not possible just have the Hamiltonian class throw an exception that's caught above?
The problem is that would mess up the current way the MCMC code fails gracefully. I think it would be possible to play hot potato with the exceptions but it would require a nontrivial rewrite of the MCMC code.
Isn't this commentary on the wrong issue?
And for some reason, it looks like your character encoding is being messed up.
I don't understand the details, but shouldn't throwing an exception have the same behavior as setting the log prob to -inf?
On Mar 31, 2015, at 6:56 PM, Michael Betancourt notifications@github.com wrote:
One thing to do is wrap all the components that need to get plumbed into the depths into a struct so the passing isn't so onerous.
It�s just one object that needs to be passed, it just needs to be passed through a ton of functions since the log_prob call is deep inside of all the MCMC code relative to the transition method.
I take it it's not possible just have the Hamiltonian class throw an exception that's caught above?
The problem is that would mess up the current way the MCMC code fails gracefully. I think it would be possible to play hot potato with the exceptions but it would require a nontrivial rewrite of the MCMC code. — Reply to this email directly or view it on GitHub.
Rolled into #362
I talked to Bob about comparing the ensemble samplers that are in https://github.com/stan-dev/stan/tree/feature/issue-362-ensemble-samplers to HMC but we're not sure what models to use for comparison. We want to use a selection of models that are fairly representative. Any suggestions?