stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.6k stars 370 forks source link

Compare Ensemble Samplers to HMC #774

Closed PeterLi2016 closed 7 years ago

PeterLi2016 commented 10 years ago

I talked to Bob about comparing the ensemble samplers that are in https://github.com/stan-dev/stan/tree/feature/issue-362-ensemble-samplers to HMC but we're not sure what models to use for comparison. We want to use a selection of models that are fairly representative. Any suggestions?

PeterLi2016 commented 10 years ago

For whatever subset of models we choose, we want to compare how HMC/NUTS performs compared to the Ensemble Samplers (stretch/walk). We want to vary the number of parameters, the number of samples per parameter, correlation among the parameters (not sure how to do this), and the scale of the parameters.

For the number of parameters I was thinking of just using 2^n from n=1 to 10 (?) and then for number of samples per parameter, I'm not sure--maybe something along the lines of 2^n from n=0 to 7? I don't know how long that would take to run--generating the data files alone has taken quite a while and it's still not done [some of the larger data files are ~300mb-600mb.

I've started with using a hierarchical logistic regression and I found that with number of samples per parameter < 8, both HMC and ensemble samplers converge to the wrong values--in some cases, some of parameters converge to the right value, but this often is not the case.

Also, is anyone familiar with how to measure memory usage? I've been looking online and I'm finding it hard to follow a lot of the suggestions. In any case, we're going to be running everything through a python script (thanks Mitzi!).

PeterLi2016 commented 10 years ago

I'm also confused about how we're going to measure time to converge.

For HMC/NUTS, I believe we are just using the warmup time (assuming it's converged by then), but what if it's converged before warmup is over?

Are we doing the same thing for the ensemble samplers--their warmup consists of just running the same algorithm as it does in sampling?

betanalpha commented 10 years ago

This is exactly the problem of which I was warning yesterday — without knowing when we’ve actually hit stationarity (not that there’s a well-defined cutoff anyways) we can’t rigorously define “warmup time”. My recommendation is to not worry about details and just record what time was used.

On Jul 18, 2014, at 7:53 PM, Peter Li notifications@github.com wrote:

I'm also confused about how we're going to measure time to converge.

For HMC/NUTS, I believe we are just using the warmup time (assuming it's converged by then), but what if it's converged before warmup is over?

Are we doing the same thing for the ensemble samplers--their warmup consists of just running the same algorithm as it does in sampling?

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 10 years ago

Exactly --- that's what we were talking about during the meeting and also what Peter and I were talking about yesterday after the meeting.

We'll just run long enough to converge, throw out the warmup altogether for both ensemble and HMC, and then just measure what happens after convergence.

If you have simple model examples that won't converge in NUTS, send them our way (or Michael's way).

On Jul 18, 2014, at 3:20 PM, Michael Betancourt notifications@github.com wrote:

This is exactly the problem of which I was warning yesterday — without knowing when we’ve actually hit stationarity (not that there’s a well-defined cutoff anyways) we can’t rigorously define “warmup time”. My recommendation is to not worry about details and just record what time was used.

On Jul 18, 2014, at 7:53 PM, Peter Li notifications@github.com wrote:

I'm also confused about how we're going to measure time to converge.

For HMC/NUTS, I believe we are just using the warmup time (assuming it's converged by then), but what if it's converged before warmup is over?

Are we doing the same thing for the ensemble samplers--their warmup consists of just running the same algorithm as it does in sampling?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

PeterLi2016 commented 10 years ago

For instance, this model

data {
  int<lower=0> N;
  int<lower=0> n;
  matrix[N,n] x;
  int<lower=0,upper=1> z[N];
}
parameters {
  vector[n] beta;
  real<lower=0> tau;
}
model {
  tau ~ cauchy(0, 5);
  beta ~ normal(0, tau);
  z ~ bernoulli_logit(x*beta);
}

fails with 1 data point per parameter (i.e. this data:

x <- 
structure(c(1, 1, 1, -0.0557144621399927, 1.02089488206537, -2.6239499104236,
-0.315010743320061, 0.611820113275082, -1.8583627822769),
.Dim = c(3, 3))
z <- 
c(0, 1, 0)
N <- 3
n <- 3

where correct values are

beta <- 
c(2.36082834041899, 0.878581822661834, 0.827042148589953)
tau <- 2
PeterLi2016 commented 10 years ago

It's only when I get at least 8 data points per parameter that this model gives the right answer.

betanalpha commented 10 years ago

With HMC or the ensemble method or both? If HMC, what is n_divergent for each run?

With only one data point you’re going to have really pathological posteriors and I wouldn’t be surprised if neither fit using the centered parameterization.

On Jul 21, 2014, at 7:03 PM, Peter Li notifications@github.com wrote:

It's only when I get at least 8 data points per parameter that this model gives the right answer.

— Reply to this email directly or view it on GitHub.

PeterLi2016 commented 10 years ago

Both. Here is the output for HMC:

                 Mean     MCSE   StdDev        5%    50%    95%  N_Eff  N_Eff/s  R_hat
lp__             -6.5  5.6e-01  3.9e+00  -1.4e+01   -5.5   -2.4     47      186    1.0
accept_stat__    0.57  8.5e-02  4.0e-01   5.6e-13   0.70    1.0     22       87    1.0
stepsize__      0.043  5.8e-16  4.1e-16   4.3e-02  0.043  0.043   0.50      2.0   1.00
treedepth__       2.8  2.8e-01  1.6e+00   1.0e+00    3.0    6.0     33      131    1.0
n_leapfrog__       13  2.5e+00  4.2e+01   1.0e+00    5.0     35    299     1192    1.0
n_divergent__   0.015  7.7e-03  1.2e-01   0.0e+00   0.00   0.00    244      973    1.0
beta[1]          -4.4  1.2e+00  2.7e+01  -2.0e+01   -1.0    2.9    484     1930    1.0
beta[2]            10  2.3e+00  3.8e+01  -2.2e+00    2.4     47    279     1110    1.0
beta[3]            13  5.1e+00  8.5e+01  -4.6e+00    2.4     40    285     1136   1.00
tau                15  4.4e+00  5.9e+01   1.2e+00    3.9     48    180      718    1.0

and for stretch move ensemble:

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__             -12  2.0e-01  1.5e+00    -15   -12   -10     57       19    1.0
accept_stat__   0.47  5.8e-03  1.4e-01   0.23  0.47  0.71    611      208    1.0
scale__          2.0  1.1e-15  2.5e-13    2.0   2.0   2.0  50000    17053   1.00
beta[1]         -5.7  1.1e+00  8.3e+00    -21  -3.3  0.35     60       20    1.0
beta[2]           15  3.7e+00  2.8e+01    1.2   7.4    46     59       20    1.0
beta[3]           13  3.2e+00  3.2e+01  -0.48   6.6    38     97       33    1.0
tau               18  4.0e+00  3.4e+01    4.2    10    53     71       24    1.0

and for walk move ensemble:

                Mean     MCSE   StdDev        5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__             -12  1.5e-01  1.2e+00  -1.4e+01   -12   -10     57       17    1.0
accept_stat__   0.13  6.2e-03  1.0e-01   1.1e-03  0.12  0.32    263       78    1.0
scale__          2.0  1.1e-15  2.5e-13   2.0e+00   2.0   2.0  50000    14831   1.00
beta[1]         -3.6  6.7e-01  3.9e+00  -1.1e+01  -2.8   1.2     33      9.9    1.0
beta[2]          9.5  1.1e+00  7.8e+00   1.2e+00   7.4    25     47       14    1.0
beta[3]           10  1.3e+00  8.2e+00   1.3e+00   8.0    28     39       12    1.1
tau               13  1.2e+00  8.1e+00   4.3e+00    11    28     45       13    1.0

The correct values are the ones in the previous message.

betanalpha commented 10 years ago

HMC is so wonderful! It tells us when it fails!

Let’s start with tau, which has a HUGE tail — the 95% interval is over five standard deviations from the mean! Because there is very little data, we’re essentially sampling from the prior, which is not efficiently parameterized. This also manifests in the number of divergences which should all be at small tau. These divergences indicate that HMC cannot efficiently explore the model as is (which also means that the ensemble sampler shouldn’t be able to, either).

So, how do we fix it? One thing to do would be to tighten the prior a bit, say cauchy(0, 2.5) or even cauchy(0, 1) instead of cauchy(0, 5). But save that for later — what you really want to do is employ a non-centered parameterization,

data { int N; int n; matrix[N,n] x; int z[N]; } parameters { vector[n] beta_raw; real tau; } transformed parameters { vector[n] beta; beta <- tau * beta_raw; } model { tau ~ cauchy(0, 5); beta_raw ~ normal(0, 1); z ~ bernoulli_logit(x*beta); }

This will break the strong correlations between tau and the betas and should dramatically improve sampling (at least for HMC).

On Jul 21, 2014, at 9:04 PM, Peter Li notifications@github.com wrote:

Both. Here is the output for HMC:

             Mean     MCSE   StdDev        5%    50%    95%  N_Eff  N_Eff/s  R_hat

lp -6.5 5.6e-01 3.9e+00 -1.4e+01 -5.5 -2.4 47 186 1.0 accept_stat 0.57 8.5e-02 4.0e-01 5.6e-13 0.70 1.0 22 87 1.0 stepsize 0.043 5.8e-16 4.1e-16 4.3e-02 0.043 0.043 0.50 2.0 1.00 treedepth__ 2.8 2.8e-01 1.6e+00 1.0e+00 3.0 6.0 33 131 1.0 n_leapfrog 13 2.5e+00 4.2e+01 1.0e+00 5.0 35 299 1192 1.0 n_divergent__ 0.015 7.7e-03 1.2e-01 0.0e+00 0.00 0.00 244 973 1.0 beta[1] -4.4 1.2e+00 2.7e+01 -2.0e+01 -1.0 2.9 484 1930 1.0 beta[2] 10 2.3e+00 3.8e+01 -2.2e+00 2.4 47 279 1110 1.0 beta[3] 13 5.1e+00 8.5e+01 -4.6e+00 2.4 40 285 1136 1.00 tau 15 4.4e+00 5.9e+01 1.2e+00 3.9 48 180 718 1.0

and for stretch move ensemble:

            Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat

lp__ -12 2.0e-01 1.5e+00 -15 -12 -10 57 19 1.0 accept_stat 0.47 5.8e-03 1.4e-01 0.23 0.47 0.71 611 208 1.0 scale 2.0 1.1e-15 2.5e-13 2.0 2.0 2.0 50000 17053 1.00 beta[1] -5.7 1.1e+00 8.3e+00 -21 -3.3 0.35 60 20 1.0 beta[2] 15 3.7e+00 2.8e+01 1.2 7.4 46 59 20 1.0 beta[3] 13 3.2e+00 3.2e+01 -0.48 6.6 38 97 33 1.0 tau 18 4.0e+00 3.4e+01 4.2 10 53 71 24 1.0

and for walk move ensemble:

            Mean     MCSE   StdDev        5%   50%   95%  N_Eff  N_Eff/s  R_hat

lp__ -12 1.5e-01 1.2e+00 -1.4e+01 -12 -10 57 17 1.0 accept_stat 0.13 6.2e-03 1.0e-01 1.1e-03 0.12 0.32 263 78 1.0 scale 2.0 1.1e-15 2.5e-13 2.0e+00 2.0 2.0 50000 14831 1.00 beta[1] -3.6 6.7e-01 3.9e+00 -1.1e+01 -2.8 1.2 33 9.9 1.0 beta[2] 9.5 1.1e+00 7.8e+00 1.2e+00 7.4 25 47 14 1.0 beta[3] 10 1.3e+00 8.2e+00 1.3e+00 8.0 28 39 12 1.1 tau 13 1.2e+00 8.1e+00 4.3e+00 11 28 45 13 1.0

The correct values are the ones in the previous message.

— Reply to this email directly or view it on GitHub.

PeterLi2016 commented 10 years ago

Ok, so with the new model, for HMC, I get

                 Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__             -1.6  7.2e-02  1.5e+00   -4.5  -1.3  0.25    452     5797    1.0
accept_stat__    0.82  1.0e-02  2.6e-01   0.21  0.95   1.0    637     8179   1.00
stepsize__       0.34  2.0e-15  1.4e-15   0.34  0.34  0.34   0.50      6.4   1.00
treedepth__       3.0  2.3e-02  7.2e-01    2.0   3.0   4.0    940    12069    1.0
n_leapfrog__      7.8  1.2e-01  3.9e+00    3.0   7.0    15   1138    14602   1.00
n_divergent__   0.038  9.8e-03  1.9e-01   0.00  0.00  0.00    383     4919   1.00
beta_raw[1]      0.94  2.8e-02  7.3e-01  -0.19  0.90   2.2    703     9021    1.0
beta_raw[2]      0.59  2.7e-02  7.3e-01  -0.54  0.54   1.9    725     9300    1.0
beta_raw[3]      0.16  2.7e-02  8.0e-01   -1.1  0.15   1.5    865    11101    1.0
tau                26  7.6e+00  1.3e+02   0.71   5.8    82    286     3667    1.0
beta[1]            26  7.0e+00  1.5e+02  -0.41   4.8    74    434     5573    1.0
beta[2]            16  5.2e+00  1.1e+02   -4.1   2.4    52    446     5721    1.0
beta[3]          -4.1  3.2e+00  6.9e+01    -24  0.48    21    464     5959    1.0

and for stretch ensemble:

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -7.2  6.2e-02  5.6e-01   -8.2  -7.1  -6.3     81      540   1.00
accept_stat__   0.58  3.8e-03  1.3e-01   0.36  0.58  0.80   1219     8088   1.00
scale__          2.0  3.0e-17  1.3e-15    2.0   2.0   2.0   2000    13271   1.00
beta_raw[1]     0.93  3.9e-02  2.2e-01   0.54  0.93   1.3     33      218    1.1
beta_raw[2]     0.54  6.7e-02  3.0e-01   0.10  0.51   1.1     20      135    1.1
beta_raw[3]     0.16  3.6e-02  2.6e-01  -0.26  0.15  0.63     53      349    1.1
tau               28  4.9e+00  6.6e+01    5.3    12    87    178     1179    1.0
beta[1]           27  5.2e+00  6.8e+01    4.0    12    85    170     1128    1.0
beta[2]           13  2.0e+00  2.9e+01    1.2   6.7    41    209     1384    1.0
beta[3]          2.9  1.1e+00  1.4e+01   -5.0   1.7    16    163     1082    1.0

and for walk move ensemble

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -7.1  9.2e-02  5.6e-01   -8.1  -7.1  -6.3     37      226    1.0
accept_stat__   0.25  6.4e-03  1.3e-01  0.050  0.24  0.48    414     2516   1.00
scale__          2.0  3.0e-17  1.3e-15    2.0   2.0   2.0   2000    12148   1.00
beta_raw[1]     0.92  2.4e-02  2.5e-01   0.52  0.91   1.3    109      661    1.0
beta_raw[2]     0.62  1.9e-02  2.2e-01   0.25  0.62  1.00    137      831    1.0
beta_raw[3]     0.19  3.5e-02  2.8e-01  -0.29  0.20  0.63     64      391    1.0
tau               29  5.8e+00  6.9e+01    5.1    12   102    141      859    1.0
beta[1]           27  5.3e+00  6.7e+01    3.4    11    91    159      965    1.0
beta[2]           17  3.3e+00  3.8e+01    2.2   7.3    54    138      838    1.0
beta[3]          7.6  2.4e+00  3.6e+01   -4.4   2.4    27    223     1353   1.00
PeterLi2016 commented 10 years ago

And even with 2 data points per parameter, I get bad results--but not as bad.

HMC:

                 Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__             -5.1  3.8e-02  1.6e+00   -8.3  -4.8  -3.2   1858     4239    1.0
accept_stat__    0.83  6.4e-03  2.6e-01   0.13  0.95   1.0   1659     3787    1.0
stepsize__       0.36  9.5e-15  6.7e-15   0.36  0.36  0.36   0.50      1.1   1.00
treedepth__       2.9  1.1e-02  7.2e-01    2.0   3.0   4.0   3998     9122    1.0
n_leapfrog__      6.9  5.0e-02  3.6e+00    3.0   7.0    15   5303    12101   1.00
n_divergent__   0.015  1.7e-03  1.2e-01   0.00  0.00  0.00   5494    12537    1.0
beta_raw[1]      0.65  1.2e-02  7.2e-01  -0.62  0.70   1.7   3665     8364    1.0
beta_raw[2]      0.73  1.3e-02  7.5e-01  -0.46  0.72   2.0   3232     7376    1.0
beta_raw[3]      0.66  1.3e-02  8.0e-01  -0.67  0.69   1.9   3555     8112    1.0
tau               4.6  4.2e-01  1.5e+01   0.15   1.5    17   1225     2796    1.0
beta[1]           4.1  4.6e-01  1.7e+01  -0.34   1.1    15   1460     3332   1.00
beta[2]           3.4  5.2e-01  2.0e+01  -0.44   1.0    11   1502     3428    1.0
beta[3]           4.2  3.9e-01  1.5e+01  -0.60  0.88    17   1582     3610   1.00

Stretch Ensemble:

                Mean     MCSE   StdDev    5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -9.1  2.1e-02  5.3e-01   -10  -9.1  -8.3    649      802    1.0
accept_stat__   0.57  1.8e-03  1.3e-01  0.35  0.58  0.79   5631     6952   1.00
scale__          2.0  9.0e-16  9.0e-14   2.0   2.0   2.0  10000    12345   1.00
beta_raw[1]     0.61  1.8e-02  2.6e-01  0.19  0.61   1.0    204      252    1.0
beta_raw[2]     0.77  1.6e-02  2.5e-01  0.34  0.78   1.2    229      283    1.0
beta_raw[3]     0.63  2.1e-02  2.8e-01  0.15  0.64   1.1    177      218    1.0
tau              9.3  1.4e+00  3.9e+01   1.3   3.3    26    814     1005    1.0
beta[1]          5.7  7.6e-01  2.4e+01  0.40   2.1    16    959     1183   1.00
beta[2]          7.1  9.8e-01  3.0e+01  0.75   2.5    21    961     1186   1.00
beta[3]          5.9  8.0e-01  2.7e+01  0.34   2.1    16   1117     1379   1.00

Walk Ensemble:

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -9.1  4.6e-02  5.7e-01    -10  -9.1  -8.3    154      167   1.00
accept_stat__   0.23  2.7e-03  1.2e-01  0.043  0.23  0.45   1971     2136    1.0
scale__          2.0  9.0e-16  9.0e-14    2.0   2.0   2.0  10000    10839   1.00
beta_raw[1]     0.67  1.7e-02  2.5e-01   0.25  0.68   1.1    218      236    1.0
beta_raw[2]     0.73  1.4e-02  2.6e-01   0.30  0.73   1.2    360      390    1.0
beta_raw[3]     0.65  1.5e-02  2.8e-01   0.20  0.65   1.1    318      345    1.0
tau              8.0  1.6e+00  2.5e+01    1.2   3.2    31    233      252    1.0
beta[1]          5.9  1.3e+00  2.0e+01   0.46   2.2    21    251      272    1.0
beta[2]          5.6  9.5e-01  1.5e+01   0.64   2.3    19    249      270    1.0
beta[3]          5.8  1.3e+00  2.3e+01   0.38   2.1    21    295      320    1.0
betanalpha commented 10 years ago

You can see that it’s better for small data but still not great. But you’re also still getting divergences (when working there should be no divergences at all) which one would indicate that the sampler is failing. At this point you have to force the step size to be larger (either by hand or by increasing the target acceptance probability). Check out the experiments in the HMC for hierarchical models paper (http://arxiv.org/abs/1312.0906) where I had to push the step size down to ensure reasonably fits as I scanned through the pseudo data size.

On Jul 22, 2014, at 3:52 AM, Peter Li notifications@github.com wrote:

And even with 2 data points per parameter, I get bad results--but not as bad.

HMC:

             Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat

lp -5.1 3.8e-02 1.6e+00 -8.3 -4.8 -3.2 1858 4239 1.0 accept_stat 0.83 6.4e-03 2.6e-01 0.13 0.95 1.0 1659 3787 1.0 stepsize 0.36 9.5e-15 6.7e-15 0.36 0.36 0.36 0.50 1.1 1.00 treedepth__ 2.9 1.1e-02 7.2e-01 2.0 3.0 4.0 3998 9122 1.0 n_leapfrog 6.9 5.0e-02 3.6e+00 3.0 7.0 15 5303 12101 1.00 n_divergent__ 0.015 1.7e-03 1.2e-01 0.00 0.00 0.00 5494 12537 1.0 beta_raw[1] 0.65 1.2e-02 7.2e-01 -0.62 0.70 1.7 3665 8364 1.0 beta_raw[2] 0.73 1.3e-02 7.5e-01 -0.46 0.72 2.0 3232 7376 1.0 beta_raw[3] 0.66 1.3e-02 8.0e-01 -0.67 0.69 1.9 3555 8112 1.0 tau 4.6 4.2e-01 1.5e+01 0.15 1.5 17 1225 2796 1.0 beta[1] 4.1 4.6e-01 1.7e+01 -0.34 1.1 15 1460 3332 1.00 beta[2] 3.4 5.2e-01 2.0e+01 -0.44 1.0 11 1502 3428 1.0 beta[3] 4.2 3.9e-01 1.5e+01 -0.60 0.88 17 1582 3610 1.00 Stretch Ensemble:

            Mean     MCSE   StdDev    5%   50%   95%  N_Eff  N_Eff/s  R_hat

lp__ -9.1 2.1e-02 5.3e-01 -10 -9.1 -8.3 649 802 1.0 accept_stat 0.57 1.8e-03 1.3e-01 0.35 0.58 0.79 5631 6952 1.00 scale 2.0 9.0e-16 9.0e-14 2.0 2.0 2.0 10000 12345 1.00 beta_raw[1] 0.61 1.8e-02 2.6e-01 0.19 0.61 1.0 204 252 1.0 beta_raw[2] 0.77 1.6e-02 2.5e-01 0.34 0.78 1.2 229 283 1.0 beta_raw[3] 0.63 2.1e-02 2.8e-01 0.15 0.64 1.1 177 218 1.0 tau 9.3 1.4e+00 3.9e+01 1.3 3.3 26 814 1005 1.0 beta[1] 5.7 7.6e-01 2.4e+01 0.40 2.1 16 959 1183 1.00 beta[2] 7.1 9.8e-01 3.0e+01 0.75 2.5 21 961 1186 1.00 beta[3] 5.9 8.0e-01 2.7e+01 0.34 2.1 16 1117 1379 1.00 Walk Ensemble:

            Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat

lp__ -9.1 4.6e-02 5.7e-01 -10 -9.1 -8.3 154 167 1.00 accept_stat 0.23 2.7e-03 1.2e-01 0.043 0.23 0.45 1971 2136 1.0 scale 2.0 9.0e-16 9.0e-14 2.0 2.0 2.0 10000 10839 1.00 beta_raw[1] 0.67 1.7e-02 2.5e-01 0.25 0.68 1.1 218 236 1.0 beta_raw[2] 0.73 1.4e-02 2.6e-01 0.30 0.73 1.2 360 390 1.0 beta_raw[3] 0.65 1.5e-02 2.8e-01 0.20 0.65 1.1 318 345 1.0 tau 8.0 1.6e+00 2.5e+01 1.2 3.2 31 233 252 1.0 beta[1] 5.9 1.3e+00 2.0e+01 0.46 2.2 21 251 272 1.0 beta[2] 5.6 9.5e-01 1.5e+01 0.64 2.3 19 249 270 1.0 beta[3] 5.8 1.3e+00 2.3e+01 0.38 2.1 21 295 320 1.0 — Reply to this email directly or view it on GitHub.

PeterLi2016 commented 10 years ago

Ok, so I tried playing with adapt delta and with defaults HMC gives:

                    Mean     MCSE   StdDev     5%    50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -4.4e+00  3.3e-02  1.6e+00   -7.5   -4.1  -2.5   2209     5865    1.0
accept_stat__    8.8e-01  4.5e-03  2.2e-01   0.33   0.97   1.0   2258     5994    1.0
stepsize__       3.9e-01  1.7e-14  1.2e-14   0.39   0.39  0.39   0.50      1.3   1.00
treedepth__      2.8e+00  8.1e-03  6.1e-01    2.0    3.0   4.0   5558    14754   1.00
n_leapfrog__     6.6e+00  3.6e-02  3.0e+00    3.0    7.0    15   6984    18540   1.00
n_divergent__    7.3e-03  1.2e-03  8.5e-02   0.00   0.00  0.00   5329    14145   1.00
beta_raw[1]     -2.8e-01  1.2e-02  8.4e-01   -1.7  -0.28   1.1   4951    13143   1.00
beta_raw[2]      3.8e-01  1.3e-02  8.7e-01   -1.1   0.39   1.8   4554    12087   1.00
beta_raw[3]      5.9e-01  1.4e-02  9.7e-01   -1.0   0.62   2.1   4540    12052   1.00
tau              4.2e+00  1.1e+00  4.6e+01  0.076   0.92    11   1716     4556    1.0
beta[1]         -1.7e+00  4.3e-01  2.2e+01   -5.6  -0.18   1.2   2704     7178    1.0
beta[2]          2.5e+00  6.4e-01  2.9e+01  -0.97   0.25   7.2   2057     5460    1.0
beta[3]          4.3e+00  1.3e+00  5.4e+01  -0.66   0.45    12   1784     4734    1.0

with adapt delta = .9 HMC gives:

                 Mean     MCSE   StdDev     5%    50%   95%  N_Eff  N_Eff/s  R_hat
lp__             -4.4  3.0e-02  1.5e+00   -7.4   -4.1  -2.5   2669     6539    1.0
accept_stat__    0.90  4.8e-03  1.9e-01   0.45   0.98   1.0   1553     3805    1.0
stepsize__       0.33  5.5e-16  3.9e-16   0.33   0.33  0.33   0.50      1.2   1.00
treedepth__       3.1  1.1e-02  6.8e-01    2.0    3.0   4.0   3772     9241   1.00
n_leapfrog__      8.1  5.4e-02  3.9e+00    3.0    7.0    15   5288    12956   1.00
n_divergent__   0.013  4.4e-03  1.2e-01   0.00   0.00  0.00    697     1708    1.0
beta_raw[1]     -0.28  1.2e-02  8.3e-01   -1.6  -0.28   1.1   4383    10736    1.0
beta_raw[2]      0.36  1.4e-02  8.6e-01   -1.1   0.37   1.8   3865     9469   1.00
beta_raw[3]      0.59  1.7e-02  9.9e-01   -1.1   0.61   2.2   3496     8564    1.0
tau                11  6.2e+00  1.3e+02  0.077   0.94    14    414     1015    1.0
beta[1]          0.34  1.7e+00  4.7e+01   -5.6  -0.18   1.3    714     1749    1.0
beta[2]           2.4  6.0e-01  3.2e+01   -1.1   0.24   7.4   2881     7057    1.0
beta[3]            13  7.5e+00  1.6e+02  -0.68   0.45    15    472     1155    1.0

and with adapt delta=.99 HMC gives:

                    Mean     MCSE   StdDev     5%    50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -4.4e+00  3.0e-02  1.6e+00   -7.5   -4.1  -2.5   2675     3734   1.00
accept_stat__    9.7e-01  2.0e-03  9.3e-02   0.89   1.00  1.00   2149     3000    1.0
stepsize__       1.6e-01  1.3e-14  9.5e-15   0.16   0.16  0.16   0.50     0.70   1.00
treedepth__      4.1e+00  8.4e-03  7.0e-01    3.0    4.0   5.0   6842     9552   1.00
n_leapfrog__     1.8e+01  9.2e-02  8.1e+00    7.0     15    31   7785    10869   1.00
n_divergent__    4.3e-03  1.2e-03  6.5e-02   0.00   0.00  0.00   2752     3842   1.00
beta_raw[1]     -3.0e-01  1.3e-02  8.4e-01   -1.7  -0.31   1.1   3926     5481    1.0
beta_raw[2]      3.7e-01  1.3e-02  8.8e-01   -1.1   0.38   1.8   4415     6164    1.0
beta_raw[3]      5.7e-01  1.5e-02  9.7e-01   -1.1   0.59   2.1   4401     6144    1.0
tau              4.8e+00  9.7e-01  3.9e+01  0.072   0.93    11   1622     2265    1.0
beta[1]         -2.5e+00  7.9e-01  3.5e+01   -6.3  -0.20   1.1   1967     2746    1.0
beta[2]          2.7e+00  6.5e-01  2.6e+01  -0.91   0.27   7.9   1589     2219    1.0
beta[3]          4.9e+00  1.2e+00  5.4e+01  -0.68   0.40    13   2151     3004    1.0

where correct values are

beta_actual <- 
c(1.25749972917763, 1.11573450571151, 1.41430662995209)
tau_actual <- 2

As a side note, is it better if I generate beta ~ normal(mu, sigma) or just keep what I have now (beta ~ normal(0, tau) )?

bob-carpenter commented 10 years ago

You can't expect to get the exact parameters back with just a couple of samples.

Just consider drawing two samples from normal(0,1). The MLE will just be their average, and that's unlikely to be 0. Similarly, the Bayesian posterior is unlikely to be centered on 0.

It's not until you get a whole lot of samples that you can expect to be near 0 in your MLE or posterior mean/median.

What you're looking for is N% of the values being in the N% posterior intervals. And this is pretty close, with 3 out of 4 and one very close to being in the 95% interval.

For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.

P.S. There's also huge skew --- look at how far the medians are from the means.

On Jul 22, 2014, at 11:52 AM, Peter Li notifications@github.com wrote:

Ok, so I tried playing with adapt delta and with defaults HMC gives:

                Mean     MCSE   StdDev     5%    50%   95%  N_Eff  N_Eff/s  R_hat

lp -4.4e+00 3.3e-02 1.6e+00 -7.5 -4.1 -2.5 2209 5865 1.0 accept_stat 8.8e-01 4.5e-03 2.2e-01 0.33 0.97 1.0 2258 5994 1.0 stepsize 3.9e-01 1.7e-14 1.2e-14 0.39 0.39 0.39 0.50 1.3 1.00 treedepth__ 2.8e+00 8.1e-03 6.1e-01 2.0 3.0 4.0 5558 14754 1.00 n_leapfrog 6.6e+00 3.6e-02 3.0e+00 3.0 7.0 15 6984 18540 1.00 n_divergent__ 7.3e-03 1.2e-03 8.5e-02 0.00 0.00 0.00 5329 14145 1.00 beta_raw[1] -2.8e-01 1.2e-02 8.4e-01 -1.7 -0.28 1.1 4951 13143 1.00 beta_raw[2] 3.8e-01 1.3e-02 8.7e-01 -1.1 0.39 1.8 4554 12087 1.00 beta_raw[3] 5.9e-01 1.4e-02 9.7e-01 -1.0 0.62 2.1 4540 12052 1.00 tau 4.2e+00 1.1e+00 4.6e+01 0.076 0.92 11 1716 4556 1.0 beta[1] -1.7e+00 4.3e-01 2.2e+01 -5.6 -0.18 1.2 2704 7178 1.0 beta[2] 2.5e+00 6.4e-01 2.9e+01 -0.97 0.25 7.2 2057 5460 1.0 beta[3] 4.3e+00 1.3e+00 5.4e+01 -0.66 0.45 12 1784 4734 1.0

with adapt delta = .9 HMC gives:

             Mean     MCSE   StdDev     5%    50%   95%  N_Eff  N_Eff/s  R_hat

lp -4.4 3.0e-02 1.5e+00 -7.4 -4.1 -2.5 2669 6539 1.0 accept_stat 0.90 4.8e-03 1.9e-01 0.45 0.98 1.0 1553 3805 1.0 stepsize 0.33 5.5e-16 3.9e-16 0.33 0.33 0.33 0.50 1.2 1.00 treedepth__ 3.1 1.1e-02 6.8e-01 2.0 3.0 4.0 3772 9241 1.00 n_leapfrog 8.1 5.4e-02 3.9e+00 3.0 7.0 15 5288 12956 1.00 n_divergent__ 0.013 4.4e-03 1.2e-01 0.00 0.00 0.00 697 1708 1.0 beta_raw[1] -0.28 1.2e-02 8.3e-01 -1.6 -0.28 1.1 4383 10736 1.0 beta_raw[2] 0.36 1.4e-02 8.6e-01 -1.1 0.37 1.8 3865 9469 1.00 beta_raw[3] 0.59 1.7e-02 9.9e-01 -1.1 0.61 2.2 3496 8564 1.0 tau 11 6.2e+00 1.3e+02 0.077 0.94 14 414 1015 1.0 beta[1] 0.34 1.7e+00 4.7e+01 -5.6 -0.18 1.3 714 1749 1.0 beta[2] 2.4 6.0e-01 3.2e+01 -1.1 0.24 7.4 2881 7057 1.0 beta[3] 13 7.5e+00 1.6e+02 -0.68 0.45 15 472 1155 1.0

and with adapt delta=.99 HMC gives:

                Mean     MCSE   StdDev     5%    50%   95%  N_Eff  N_Eff/s  R_hat

lp -4.4e+00 3.0e-02 1.6e+00 -7.5 -4.1 -2.5 2675 3734 1.00 accept_stat 9.7e-01 2.0e-03 9.3e-02 0.89 1.00 1.00 2149 3000 1.0 stepsize 1.6e-01 1.3e-14 9.5e-15 0.16 0.16 0.16 0.50 0.70 1.00 treedepth__ 4.1e+00 8.4e-03 7.0e-01 3.0 4.0 5.0 6842 9552 1.00 n_leapfrog 1.8e+01 9.2e-02 8.1e+00 7.0 15 31 7785 10869 1.00 n_divergent__ 4.3e-03 1.2e-03 6.5e-02 0.00 0.00 0.00 2752 3842 1.00 beta_raw[1] -3.0e-01 1.3e-02 8.4e-01 -1.7 -0.31 1.1 3926 5481 1.0 beta_raw[2] 3.7e-01 1.3e-02 8.8e-01 -1.1 0.38 1.8 4415 6164 1.0 beta_raw[3] 5.7e-01 1.5e-02 9.7e-01 -1.1 0.59 2.1 4401 6144 1.0 tau 4.8e+00 9.7e-01 3.9e+01 0.072 0.93 11 1622 2265 1.0 beta[1] -2.5e+00 7.9e-01 3.5e+01 -6.3 -0.20 1.1 1967 2746 1.0 beta[2] 2.7e+00 6.5e-01 2.6e+01 -0.91 0.27 7.9 1589 2219 1.0 beta[3] 4.9e+00 1.2e+00 5.4e+01 -0.68 0.40 13 2151 3004 1.0

where correct values are

beta_actual <- c(1.25749972917763, 1.11573450571151, 1.41430662995209) tau_actual <- 2

As a side note, is it better if I generate beta ~ normal(mu, sigma) or just keep what I have now (beta ~ normal(0, tau) )?

— Reply to this email directly or view it on GitHub.

betanalpha commented 10 years ago

For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.

I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.

Anyways, can you get 4-5 datum per group to fit reasonably well? If so, just start there.

PeterLi2016 commented 10 years ago

But the thing is, I'm running these for 10,000 warmup iterations and then 10,000 sampling iterations, so I thought that the results would have converged by then. Instead, with 1 data per parameter, I was finding a lot of variation from run to run in the results. In any case, with 4 data per parameter, the results are much more consistent from run to run and even across the samplers.

The correct values are

beta_actual <- 
c(1.25749972917763, 1.11573450571151, 1.41430662995209)
tau_actual <- 2

HMC gives

                    Mean     MCSE   StdDev        5%       50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -8.6e+00  4.4e-02  1.8e+00  -1.2e+01  -8.2e+00  -6.4   1700     3787   1.00
accept_stat__    7.9e-01  4.8e-03  2.7e-01   1.2e-01   9.2e-01   1.0   3136     6986   1.00
stepsize__       4.7e-01  2.9e-15  2.1e-15   4.7e-01   4.7e-01  0.47   0.50      1.1   1.00
treedepth__      2.5e+00  8.7e-03  6.4e-01   1.0e+00   3.0e+00   3.0   5342    11900   1.00
n_leapfrog__     5.3e+00  3.3e-02  2.5e+00   1.0e+00   7.0e+00   7.0   5654    12595   1.00
n_divergent__    6.0e-04  2.4e-04  2.4e-02   0.0e+00   0.0e+00  0.00  10000    22279    1.0
beta_raw[1]     -5.4e-02  1.1e-02  6.7e-01  -1.2e+00  -1.7e-02  0.98   3432     7647   1.00
beta_raw[2]      3.3e-01  1.3e-02  7.6e-01  -8.3e-01   2.7e-01   1.7   3599     8018   1.00
beta_raw[3]      1.1e+00  1.4e-02  8.1e-01  -2.2e-01   1.1e+00   2.4   3546     7901   1.00
tau              1.3e+00  3.1e-02  1.2e+00   1.5e-01   1.0e+00   3.8   1562     3480   1.00
beta[1]          7.5e-02  2.1e-02  8.7e-01  -1.0e+00  -9.5e-03   1.4   1773     3950   1.00
beta[2]          2.5e-01  1.6e-02  9.0e-01  -1.2e+00   2.2e-01   1.7   3288     7324   1.00
beta[3]          1.6e+00  4.9e-02  2.0e+00  -6.4e-02   1.1e+00   5.0   1580     3521   1.00

Stretch Ensemble gives

                    Mean     MCSE   StdDev        5%       50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -1.2e+01  4.6e-02  6.1e-01  -1.3e+01  -1.2e+01   -12    178      203    1.0
accept_stat__    5.8e-01  1.7e-03  1.3e-01   3.6e-01   5.8e-01  0.79   5802     6590    1.0
scale__          2.0e+00  9.0e-16  9.0e-14   2.0e+00   2.0e+00   2.0  10000    11358   1.00
beta_raw[1]     -3.5e-02  2.5e-02  2.3e-01  -4.2e-01  -3.0e-02  0.34     86       97    1.0
beta_raw[2]      3.6e-01  2.3e-02  2.4e-01  -1.5e-02   3.5e-01  0.76    110      125    1.0
beta_raw[3]      1.1e+00  1.7e-02  2.5e-01   7.2e-01   1.1e+00   1.5    212      241   1.00
tau              1.4e+00  3.5e-02  4.6e-01   7.8e-01   1.3e+00   2.2    167      190    1.0
beta[1]         -3.3e-02  3.3e-02  3.0e-01  -5.2e-01  -3.9e-02  0.47     83       94    1.0
beta[2]          4.7e-01  3.0e-02  3.4e-01  -2.3e-02   4.5e-01   1.0    130      147    1.0
beta[3]          1.6e+00  4.7e-02  6.3e-01   7.4e-01   1.5e+00   2.7    186      211    1.0

Walk Ensemble gives

                    Mean     MCSE   StdDev        5%       50%   95%  N_Eff  N_Eff/s  R_hat
lp__            -1.2e+01  5.1e-02  6.4e-01  -1.4e+01  -1.2e+01   -12    160      168   1.00
accept_stat__    2.4e-01  2.6e-03  1.2e-01   4.8e-02   2.3e-01  0.46   2206     2317    1.0
scale__          2.0e+00  9.0e-16  9.0e-14   2.0e+00   2.0e+00   2.0  10000    10500   1.00
beta_raw[1]     -5.8e-02  1.2e-02  2.2e-01  -4.4e-01  -5.3e-02  0.30    348      365    1.0
beta_raw[2]      3.3e-01  1.4e-02  2.5e-01  -6.9e-02   3.2e-01  0.76    320      336    1.0
beta_raw[3]      1.1e+00  1.5e-02  2.6e-01   6.4e-01   1.1e+00   1.5    324      340    1.0
tau              1.4e+00  3.4e-02  4.8e-01   7.8e-01   1.3e+00   2.2    194      204    1.0
beta[1]         -5.9e-02  1.5e-02  3.0e-01  -5.4e-01  -7.2e-02  0.45    381      401   1.00
beta[2]          4.4e-01  1.9e-02  3.6e-01  -1.0e-01   4.2e-01   1.0    371      389    1.0
beta[3]          1.5e+00  4.2e-02  6.5e-01   6.3e-01   1.4e+00   2.7    242      254    1.0
bob-carpenter commented 10 years ago

On Jul 23, 2014, at 3:51 AM, Michael Betancourt notifications@github.com wrote:

For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.

I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.

Isn't that only true as the number of data points approaches infinity? Maybe I'm misunderstanding.

Suppose you have a model with known scale:

y ~ normal(mu,1);

and no prior on mu. Now suppose you simulate a couple data points

y_sim ~ normal(0,1);

If y_sim = {-1, -2}, how is (mean - 0) / MCSE going to converge to 0? In fact, that quantity looks like it'll diverge, because mean << 0, and MCSE -> 0.

betanalpha commented 10 years ago

This kind of behavior is exactly why geometric ergodicity is so important. Any MCMC estimate will converge to the true answer with an infinite number of iterations, but without geometric ergodicity we don't have any idea of how the estimate converges. In particular, it could wanter for millions of iterations or thrash between two values and never really converge for any finite set of MCMC samples. Geometric ergodicity gives us the Monte Carlo Central Limit Theorem which gives meaning to the Monte Carlo Standard Error.

Long story short, if you see divergences then you can't expect expectations to converge to the true values in any reasonable amount of time. Hopefully we'll have a paper on this out in a few months.

On Wed, Jul 23, 2014 at 5:06 PM, Bob Carpenter notifications@github.com wrote:

On Jul 23, 2014, at 3:51 AM, Michael Betancourt notifications@github.com wrote:

For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.

I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.

Isn't that only true as the number of data points approaches infinity? Maybe I'm misunderstanding.

Suppose you have a model with known scale:

y ~ normal(mu,1);

and no prior on mu. Now suppose you simulate a couple data points

y_sim ~ normal(0,1);

If y_sim = {-1, -2}, how is (mean - 0) / MCSE going to converge to 0? In fact, that quantity looks like it'll diverge, because mean << 0, and MCSE -> 0.

  • Bob

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-49895710.

bob-carpenter commented 10 years ago

But that didn't answer my question.

The "true" answer for the posterior isn't the simulated parameters, but whatever's determined from the data. Now the true parameters should be in the 95% posterior intervals 95% of the time, but I think that's very different from what you're suggesting.

The divergences they're seeing are like 1 every 1000 iterations. Is that too many?

On Jul 23, 2014, at 12:23 PM, Michael Betancourt notifications@github.com wrote:

This kind of behavior is exactly why geometric ergodicity is so important. Any MCMC estimate will converge to the true answer with an infinite number of iterations, but without geometric ergodicity we don't have any idea of how the estimate converges. In particular, it could wanter for millions of iterations or thrash between two values and never really converge for any finite set of MCMC samples. Geometric ergodicity gives us the Monte Carlo Central Limit Theorem which gives meaning to the Monte Carlo Standard Error.

Long story short, if you see divergences then you can't expect expectations to converge to the true values in any reasonable amount of time. Hopefully we'll have a paper on this out in a few months.

On Wed, Jul 23, 2014 at 5:06 PM, Bob Carpenter notifications@github.com wrote:

On Jul 23, 2014, at 3:51 AM, Michael Betancourt notifications@github.com wrote:

For this project, you might need to stick to using more data. Even then, you can't expect to exactly recover the simulation (fake data) parameter values --- you'll just have to run HMC long enough to get a good answer you can believe in and then use that as a baseline.

I agree that it’s fine to start at some reasonable amount of data (increasing the target acceptance rate helps, but the few datum posteriors are just going to be too hard to sample from for a test like this) but I don’t agree that we can’t recover the simulated parameter values. If (mean - true) / MSCE doesn’t converge to N(0, 1) then we have some serious problems.

Isn't that only true as the number of data points approaches infinity? Maybe I'm misunderstanding.

Suppose you have a model with known scale:

y ~ normal(mu,1);

and no prior on mu. Now suppose you simulate a couple data points

y_sim ~ normal(0,1);

If y_sim = {-1, -2}, how is (mean - 0) / MCSE going to converge to 0? In fact, that quantity looks like it'll diverge, because mean << 0, and MCSE -> 0.

  • Bob

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-49895710.

— Reply to this email directly or view it on GitHub.

betanalpha commented 10 years ago

But that didn't answer my question.

The "true" answer for the posterior isn't the simulated parameters, but whatever's determined from the data. Now the true parameters should be in the 95% posterior intervals 95% of the time, but I think that's very different from what you're suggesting.

Ah, I see what you’re saying. There’s no guarantee that Bayesian probabilities will have proper Frequentist coverage, so you can’t rely on a test like that. Especially with strong priors.

The divergences they're seeing are like 1 every 1000 iterations. Is that too many?

It’s tough — technically any divergences signal problems, but the resulting bias might be much smaller than the MCSE. The Rhat test discussed in the hierarchal paper might be the most rigorous test here.

PeterLi2016 commented 10 years ago

Ok, so the graphics are uploaded here: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=sharing

I'll try running the the hier_logistic model with the rest of the data files this weekend--I may end up reducing the number of runs or removing the data files with the most number of samples per data file if it takes too long.

bob-carpenter commented 10 years ago

Or you can cut down on the number of variations you do.
Having said that, the variation with data size is very interesting.

I'm most curious to see what happens when the number of parameters gets bigger and the ensembles need more particles.

On Jul 25, 2014, at 3:54 PM, Peter Li notifications@github.com wrote:

Ok, so the graphics are uploaded her: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=sharing

I'll try running the the hier_logistic model with the rest of the data files this weekend--I may end up reducing the number of runs or removing the data files with the most number of samples per data file if it takes too long.

— Reply to this email directly or view it on GitHub.

PeterLi2016 commented 10 years ago

Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.

I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?

bob-carpenter commented 10 years ago

The link didn't work for me.

The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.

You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).

On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:

Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.

I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?

— Reply to this email directly or view it on GitHub.

PeterLi2016 commented 10 years ago

That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.

I'll try running the larger models, though it might take a while.

On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:

The link didn't work for me.

The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.

You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).

  • Bob

On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:

Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.

I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.

betanalpha commented 10 years ago

You should play around with the memory comparisons — the zero offset is misleading. Certainly use some big unit, say n * 500 MB, or define some default process and normalize everything to that.

The N_eff/s for tau for the walkers is hugely skewed by large values which are almost definitely errors in the calculation (probably the chain getting stuck and giving negligible variances). At the very least filter out the tail events (or just define some max at, say, 1000) so we can see the good stuff.

The large k (parameters), large N (data) limit isn’t particularly relevant to practice, which is typically large k, small N, so I think it’s fair to not run tests out there, but if you do consider running a centered parameterization HMC, too, which will perform much better.

On Jul 29, 2014, at 5:25 PM, Peter Li notifications@github.com wrote:

That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.

I'll try running the larger models, though it might take a while.

On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:

The link didn't work for me.

The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.

You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).

  • Bob

On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:

Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.

I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 10 years ago

That one worked. Thanks.

On Jul 29, 2014, at 12:25 PM, Peter Li notifications@github.com wrote:

That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.

I'll try running the larger models, though it might take a while.

On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:

The link didn't work for me.

The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.

You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).

  • Bob

On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:

Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.

I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 10 years ago

This is great. You can see the relevant issues already with what you have. As the number of parameters increases, you can see NUTS start to dominate. With few parameters, you can see the crossover where the relative speeds change based on the number of data points per parameter. (And as Michael says, it'd be nice to see the centered vs. non-centered parameterization comparisons, as well.)

This also shows how relatively bad HMC is at estimating the hierarchical variance. I don't know if there are other parameterization tricks to help with this, but it's problematic for us as is, I think.

Would it make sense to use ratios of n_eff/s or perhaps better, s/n_eff, rather than differences?

It'd also be great if you could get tau into the summary plots of parameter fits.

I'm also surprised all the RAM usage is so close. Are you running through R or some other system that involves saving the samples? I wouldn't expect them to be so close otherwise.

Also, what is warmup time measuring? Just the raw amount of time spent in the warmup phase? I think it would be nice for at least a couple of cases to get an idea of how much more quickly HMC converges. If you use 1000 warmup iterations for both, that's probably overkill for HMC, so comparing basic warmup times isn't so informative when the convergence happens early on during warmup.

On Jul 29, 2014, at 12:25 PM, Peter Li notifications@github.com wrote:

That's odd. How about this one: https://drive.google.com/folderview?id=0B_zAlcJ3ybbwQXdhcnZiaE9YWEU&usp=drive_web ? The google drive settings are set so anyone with the link can access it.

I'll try running the larger models, though it might take a while.

On Tue, Jul 29, 2014 at 12:21 PM, Bob Carpenter notifications@github.com wrote:

The link didn't work for me.

The only thought I have is that you could run on the cluster. I don't know how to do it, but I think Ben does. And Daniel might. We might also think of moving our intense distribution tests there if they are going to take two days up to third order.

You could also jump up more, as in multiplying by 4 or 8, and say trying 65 or 129 parameters next. I'm curious as to what's going to happen when the number of parameters increases beyond a handful (and footful).

  • Bob

On Jul 28, 2014, at 6:05 PM, Peter Li notifications@github.com wrote:

Ok, so I ran the models this weekend and I was only able to get 3,5,9,17 parameters for (2^(1+1:6) data per parameter)) for all three samplers with 100 runs each. The plots are here.

I'm going to need to reconsider the scale of the number of parameters and number of data per parameter and number of runs. Any thoughts?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-50500175.

— Reply to this email directly or view it on GitHub.

PeterLi2016 commented 10 years ago

Ok, I've got everything working on the clusters now, so hopefully the hierarchical logistic modes will be done running in a few days.

For hierarchical linear model, I'm finding that even for number of data per param, stretch/walk ensemble and HMC are both converging to similar values, like

for HMC:

                    Mean     MCSE   StdDev        5%       50%    95%  N_Eff  N_Eff/s  R_hat
lp__            -1.8e+02  2.3e-01  2.2e+00  -1.9e+02  -1.8e+02   -180     92      108    1.0
accept_stat__    8.2e-01  2.7e-02  2.8e-01   8.8e-02   9.6e-01    1.0    110      129   1.00
stepsize__       3.2e-01  1.1e-15  7.8e-16   3.2e-01   3.2e-01   0.32   0.50     0.59   1.00
treedepth__      3.1e+00  5.8e-02  8.1e-01   2.0e+00   3.0e+00    4.0    197      230    1.0
n_leapfrog__     8.7e+00  2.9e-01  4.8e+00   3.0e+00   7.0e+00     15    265      311    1.0
n_divergent__    9.0e-03  3.3e-03  9.4e-02   0.0e+00   0.0e+00   0.00    830      972    1.0
beta_raw[1]      4.9e-01  3.4e-02  7.1e-01  -7.1e-01   4.8e-01    1.7    445      521    1.0
beta_raw[2]      5.0e-02  4.5e-02  8.1e-01  -1.4e+00   7.7e-02    1.4    325      381   1.00
beta_raw[3]     -1.6e-01  3.2e-02  7.2e-01  -1.3e+00  -1.7e-01    1.1    525      614    1.0
tau              8.5e-02  1.1e-02  9.0e-02   5.2e-03   5.2e-02   0.30     67       79    1.0
sigma            9.7e-01  1.7e-03  3.5e-02   9.1e-01   9.6e-01    1.0    425      498    1.0
beta[1]          3.7e-02  2.5e-03  4.2e-02  -1.6e-02   3.1e-02   0.11    291      341   1.00
beta[2]          6.4e-03  2.6e-03  4.9e-02  -6.5e-02   2.8e-03  0.093    354      415    1.0
beta[3]         -9.9e-03  2.3e-03  4.7e-02  -9.4e-02  -6.0e-03  0.064    429      503    1.0

and stretch

                    Mean     MCSE   StdDev        5%       50%    95%  N_Eff  N_Eff/s  R_hat
lp__            -5.4e+02  6.2e-02  6.6e-01  -5.4e+02  -5.4e+02   -540    114       59   1.00
accept_stat__    5.1e-01  1.6e-03  1.2e-01   3.1e-01   5.1e-01   0.71   5441     2804    1.0
scale__          2.0e+00  9.0e-16  9.0e-14   2.0e+00   2.0e+00    2.0  10000     5154   1.00
beta_raw[1]      5.3e-01  2.3e-02  2.2e-01   1.8e-01   5.2e-01   0.89     93       48   1.00
beta_raw[2]      3.0e-02  2.4e-02  2.5e-01  -3.9e-01   2.9e-02   0.44    107       55    1.0
beta_raw[3]     -9.1e-02  2.4e-02  2.5e-01  -4.8e-01  -1.0e-01   0.34    107       55    1.0
tau              8.6e-02  5.9e-03  4.4e-02   4.1e-02   7.5e-02   0.16     54       28    1.0
sigma            9.7e-01  8.3e-04  1.0e-02   9.5e-01   9.7e-01   0.99    149       77    1.0
beta[1]          4.5e-02  3.3e-03  2.9e-02   1.2e-02   3.9e-02  0.096     76       39    1.0
beta[2]          3.1e-03  2.3e-03  2.3e-02  -3.2e-02   2.1e-03  0.040    108       56    1.0
beta[3]         -7.5e-03  2.3e-03  2.3e-02  -4.3e-02  -7.8e-03  0.027    101       52    1.0

where the correct values are

beta_actual <- 
c(-2.54587437058592, -2.81109369156363, -2.24446230109798)

And I generated the data using the similar to how I did for logistic regression:


n.data.per.param <- 2^(1 + 1:6)
for (j in 1:8) {
    P <- ncol(cholesky.corr[[j]])+1
    tau_actual <- 2
    beta_actual <- rnorm(P, 0, tau_actual)
    sigma_actual <- 0.1

    for (k in 1:length(n.data.per.param)) {
        num.iter <- (P * n.data.per.param[k]
        x <- matrix(data=NA, ncol = P, nrow=num.iter)

        for (i in 1:num.iter) {
            temp.norm <- rnorm(P-1, mean=0, sd=1)
            cor.rv <- cholesky.corr[[j]] %*% temp.norm
            x[i,] <- c(1,cor.rv)
        }

        z.temp <- x %*% beta_actual
        z <- rnorm(z.temp,sigma_actual)

        N <- length(z)
        n <- ncol(x)

        list <- c("x","z", "N", "n","beta_actual","tau_actual","sigma_actual")
        stan_rdump(list, file=paste("hier_linear_",j,"_",k,".data.R",sep=""))
    }
}
bob-carpenter commented 10 years ago

On Aug 12, 2014, at 1:06 PM, Peter Li notifications@github.com wrote:

Ok, I've got everything working on the clusters now, so hopefully the hierarchical logistic modes will be done running in a few days.

For hierarchical linear model, I'm finding that even for number of data per param, stretch/walk ensemble and HMC are both converging to similar values, like

for HMC: ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 3.7e-02 2.5e-03 4.2e-02 -1.6e-02 3.1e-02 0.11 291 341 1.00 beta[2] 6.4e-03 2.6e-03 4.9e-02 -6.5e-02 2.8e-03 0.093 354 415 1.0 beta[3] -9.9e-03 2.3e-03 4.7e-02 -9.4e-02 -6.0e-03 0.064 429 503 1.0

and stretch ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 4.5e-02 3.3e-03 2.9e-02 1.2e-02 3.9e-02 0.096 76 39 1.0 beta[2] 3.1e-03 2.3e-03 2.3e-02 -3.2e-02 2.1e-03 0.040 108 56 1.0 beta[3] -7.5e-03 2.3e-03 2.3e-02 -4.3e-02 -7.8e-03 0.027 101 52 1.0

where the correct values are

beta_actual <- c(-2.54587437058592, -2.81109369156363, -2.24446230109798)

I take it the StdDev and 5% and 95% are for the ensemble means?

Why are the values of beta so close to 0 when the actual values are close to -2.5?

PeterLi2016 commented 10 years ago

On Wed, Aug 13, 2014 at 12:05 AM, Bob Carpenter notifications@github.com wrote:

On Aug 12, 2014, at 1:06 PM, Peter Li notifications@github.com wrote:

Ok, I've got everything working on the clusters now, so hopefully the hierarchical logistic modes will be done running in a few days.

For hierarchical linear model, I'm finding that even for number of data per param, stretch/walk ensemble and HMC are both converging to similar values, like

for HMC: ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 3.7e-02 2.5e-03 4.2e-02 -1.6e-02 3.1e-02 0.11 291 341 1.00 beta[2] 6.4e-03 2.6e-03 4.9e-02 -6.5e-02 2.8e-03 0.093 354 415 1.0 beta[3] -9.9e-03 2.3e-03 4.7e-02 -9.4e-02 -6.0e-03 0.064 429 503 1.0

and stretch ... Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat beta[1] 4.5e-02 3.3e-03 2.9e-02 1.2e-02 3.9e-02 0.096 76 39 1.0 beta[2] 3.1e-03 2.3e-03 2.3e-02 -3.2e-02 2.1e-03 0.040 108 56 1.0 beta[3] -7.5e-03 2.3e-03 2.3e-02 -4.3e-02 -7.8e-03 0.027 101 52 1.0

where the correct values are

beta_actual <- c(-2.54587437058592, -2.81109369156363, -2.24446230109798)

I take it the StdDev and 5% and 95% are for the ensemble means?

Yes.

Why are the values of beta so close to 0 when the actual values are close to -2.5?

I have no idea. This was with a decent amount of data points per parameter too--I think around 8.

  • Bob

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/774#issuecomment-52007662.

PeterLi2016 commented 10 years ago

I'm uploading the graphs here. It might take an hour or so for it to finish uploading.

The first tier of numbers for the folders represents the number of beta parameters (i.e. 1 + 2^number).

The second tier of numbers for the folders represents the number of data points per parameter (i.e. 2^(1+number))

Some of the ensemble runs took too much wall time still, so there are some graphs with only HMC data. This is in the respective param folders. For these runs, the ram, sampling_times, warmup_times folders may still show ensemble data but this is in error that I still need to fix. The summary folders are empty since I wasn't sure how best to combine the plots for all of the parameters into one -- I can't put them all on 1 page anymore since there are so many more parameters in some cases.

A few other notes: NUTS pretty much always converged. Stretch ensemble mostly converges, though it takes a lot more iterations and wall time than NUTS typically. Walk ensemble was unable to converge in a lot of cases (I used the same number of iterations as I did for stretch ensemble); this typically happened after the number of parameters increased to 9 or more.

betanalpha commented 10 years ago

Sweeeeeet.

On Sep 22, 2014, at 9:13 PM, Peter Li notifications@github.com wrote:

I'm uploading the graphs here. It might take an hour or so for it to finish uploading.

The first tier of numbers for the folders represents the number of beta parameters (i.e. 1 + 2^number).

The second tier of numbers for the folders represents the number of data points per parameter (i.e. 2^(1+number))

Some of the ensemble runs took too much wall time still, so there are some graphs with only HMC data. This is in the respective param folders. For these runs, the ram, sampling_times, warmup_times folders may still show ensemble data but this is in error that I still need to fix. The summary folders are empty since I wasn't sure how best to combine the plots for all of the parameters into one -- I can't put them all on 1 page anymore since there are so many more parameters in some cases.

A few other notes: NUTS pretty much always converged. Stretch ensemble mostly converges, though it takes a lot more iterations and wall time than NUTS typically. Walk ensemble was unable to converge in a lot of cases (I used the same number of iterations as I did for stretch ensemble); this typically happened after the number of parameters increased to 9 or more.

— Reply to this email directly or view it on GitHub.

andrewgelman commented 10 years ago

This will soon be a published paper, I’m sure… On Sep 22, 2014, at 4:18 PM, Michael Betancourt notifications@github.com wrote:

Sweeeeeet.

On Sep 22, 2014, at 9:13 PM, Peter Li notifications@github.com wrote:

I'm uploading the graphs here. It might take an hour or so for it to finish uploading.

The first tier of numbers for the folders represents the number of beta parameters (i.e. 1 + 2^number).

The second tier of numbers for the folders represents the number of data points per parameter (i.e. 2^(1+number))

Some of the ensemble runs took too much wall time still, so there are some graphs with only HMC data. This is in the respective param folders. For these runs, the ram, sampling_times, warmup_times folders may still show ensemble data but this is in error that I still need to fix. The summary folders are empty since I wasn't sure how best to combine the plots for all of the parameters into one -- I can't put them all on 1 page anymore since there are so many more parameters in some cases.

A few other notes: NUTS pretty much always converged. Stretch ensemble mostly converges, though it takes a lot more iterations and wall time than NUTS typically. Walk ensemble was unable to converge in a lot of cases (I used the same number of iterations as I did for stretch ensemble); this typically happened after the number of parameters increased to 9 or more.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

betanalpha commented 9 years ago

After working some sweet, sweet template magic the refactor compile time is on par with the develop compile time (both just under 12 seconds for a simple model with very little autodiff to compile). BUT we can do better.

Right now there�s a nontrivial hit from having to pass a Writer through all of the samplers. But the Writer is used only once, when a Hamiltonian class tries to compute a gradient and spits out the �about to be rejected� error message on failure. In particular, the Hamiltonian class wraps the gradient call in a try/catch block, and if it catches an exception then it sets the potential to infinity (which will allow the sampler to gracefully terminate) and spits out the error message.

In order to avoid a Writer template we�d have to either pass the writer through each call (sample -> transition -> evolve -> components of the leapfrog update -> gradient) or not catch the exception internally (in which case the sampler won�t be able to fail as gracefully). Both seem ugly and unhappy.

Anyone have any suggestions on how to sneak an error message out from deep in the code without having an internal reference to a Writer implementation? I dunno, like storing an error string and calling write_error_message after each transition or something?

bob-carpenter commented 9 years ago

One thing to do is wrap all the components that need to get plumbed into the depths into a struct so the passing isn't so onerous.

I take it it's not possible just have the Hamiltonian class throw an exception that's caught above?

On Mar 31, 2015, at 1:02 AM, Michael Betancourt notifications@github.com wrote:

After working some sweet, sweet template magic the refactor compile time is on par with the develop compile time (both just under 12 seconds for a simple model with very little autodiff to compile). BUT we can do better.

Right now there�s a nontrivial hit from having to pass a Writer through all of the samplers. But the Writer is used only once, when a Hamiltonian class tries to compute a gradient and spits out the �about to be rejected� error message on failure. In particular, the Hamiltonian class wraps the gradient call in a try/catch block, and if it catches an exception then it sets the potential to infinity (which will allow the sampler to gracefully terminate) and spits out the error message.

In order to avoid a Writer template we�d have to either pass the writer through each call (sample -> transition -> evolve -> components of the leapfrog update -> gradient) or not catch the exception internally (in which case the sampler won�t be able to fail as gracefully). Both seem ugly and unhappy.

Anyone have any suggestions on how to sneak an error message out from deep in the code without having an internal reference to a Writer implementation? I dunno, like storing an error string and calling write_error_message after each transition or something? — Reply to this email directly or view it on GitHub.

betanalpha commented 9 years ago

One thing to do is wrap all the components that need to get plumbed into the depths into a struct so the passing isn't so onerous.

It�s just one object that needs to be passed, it just needs to be passed through a ton of functions since the log_prob call is deep inside of all the MCMC code relative to the transition method.

I take it it's not possible just have the Hamiltonian class throw an exception that's caught above?

The problem is that would mess up the current way the MCMC code fails gracefully. I think it would be possible to play hot potato with the exceptions but it would require a nontrivial rewrite of the MCMC code.

bob-carpenter commented 9 years ago

Isn't this commentary on the wrong issue?

And for some reason, it looks like your character encoding is being messed up.

I don't understand the details, but shouldn't throwing an exception have the same behavior as setting the log prob to -inf?

On Mar 31, 2015, at 6:56 PM, Michael Betancourt notifications@github.com wrote:

One thing to do is wrap all the components that need to get plumbed into the depths into a struct so the passing isn't so onerous.

It�s just one object that needs to be passed, it just needs to be passed through a ton of functions since the log_prob call is deep inside of all the MCMC code relative to the transition method.

I take it it's not possible just have the Hamiltonian class throw an exception that's caught above?

The problem is that would mess up the current way the MCMC code fails gracefully. I think it would be possible to play hot potato with the exceptions but it would require a nontrivial rewrite of the MCMC code. — Reply to this email directly or view it on GitHub.

bob-carpenter commented 7 years ago

Rolled into #362