stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.56k stars 365 forks source link

Tweak NUTS algorithm with correct behavior at each subtree #326

Closed syclik closed 10 years ago

syclik commented 10 years ago

[originally pull request #303; closed because the base branch was deleted]

I'm grabbing the relevant comment from the closed pull request.

@betanalpha wrote:

so... what I want to do is: 1) make sure the fix is correct 2) try to prevent this from breaking in the future

I understand, but we don't really have a benchmark for "correct". I've just been comparing to Matt's implementation assuming that it was correct.

To be specific I would add a bunch of verbose outputs and look at the trees as they were built between the two versions of the code. Hacking the RNGs to be in phase and the like wasn't exactly a scalable solution.

If you're convinced this fix is correct, can you spell out:

  • model you're using to test
  • the step size you're using (and if you can write out the full command line parameters you're using)
  • what you're doing with the samples afterwards. Is the proportion of (epsilon * 2^treedepth) that's between 1/4 pi and 3/4 pi for a single chain how you're looking at it?

For an iid gaussian model with unit metric and a fixed, small step size (say 1e-3 or so), the tree should terminate after traversing L = pi / epsilon states. Accounting for the fact that the algorithm won't build a full branch if a subtree fails the NUTS criterion gives the quoted range. Making the step size smaller and tracking the actual number of states in the slice instead of just estimating the number of states based on the tree depth would yield a tighter range of possible values.

This hold for any transition at any point in the iid gaussian case (module some step size considerations, but it's easy to seed this case well anyways), so warmup isn't an issue. Run one chain for a while and look at the distribution, or look at the distribution across multiple chains.

If we decide on the correctness of an implementation then we can measure the distribution of integration times and use that to test any future changes.

betanalpha commented 10 years ago

So what's the next step? I can write a test but it'll really just be for show given the points raised below.

Also, given that there significant improvements in the tails of hierarchical models I have motivation for getting this into a tagged release sooner rather than later so I can quote the version in an upcoming paper.

On Oct 24, 2013, at 8:31 PM, Daniel Lee notifications@github.com wrote:

[originally pull request #303; closed because the base branch was deleted]

I'm grabbing the relevant comment from the closed pull request.

@betanalpha wrote:

so... what I want to do is: 1) make sure the fix is correct 2) try to prevent this from breaking in the future

I understand, but we don't really have a benchmark for "correct". I've just been comparing to Matt's implementation assuming that it was correct.

To be specific I would add a bunch of verbose outputs and look at the trees as they were built between the two versions of the code. Hacking the RNGs to be in phase and the like wasn't exactly a scalable solution.

If you're convinced this fix is correct, can you spell out:

• model you're using to test • the step size you're using (and if you can write out the full command line parameters you're using) • what you're doing with the samples afterwards. Is the proportion of (epsilon * 2^treedepth) that's between 1/4 pi and 3/4 pi for a single chain how you're looking at it? For an iid gaussian model with unit metric and a fixed, small step size (say 1e-3 or so), the tree should terminate after traversing L = pi / epsilon states. Accounting for the fact that the algorithm won't build a full branch if a subtree fails the NUTS criterion gives the quoted range. Making the step size smaller and tracking the actual number of states in the slice instead of just estimating the number of states based on the tree depth would yield a tighter range of possible values.

This hold for any transition at any point in the iid gaussian case (module some step size considerations, but it's easy to seed this case well anyways), so warmup isn't an issue. Run one chain for a while and look at the distribution, or look at the distribution across multiple chains.

If we decide on the correctness of an implementation then we can measure the distribution of integration times and use that to test any future changes.

— Reply to this email directly or view it on GitHub.

matthewdhoffman commented 10 years ago

If we have a trusted reference implementation, then can't we just test if the series of samples generated from some model are identical given identical seeds?

On Thu, Oct 24, 2013 at 1:46 PM, Michael Betancourt < notifications@github.com> wrote:

So what's the next step? I can write a test but it'll really just be for show given the points raised below.

Also, given that there significant improvements in the tails of hierarchical models I have motivation for getting this into a tagged release sooner rather than later so I can quote the version in an upcoming paper.

On Oct 24, 2013, at 8:31 PM, Daniel Lee notifications@github.com wrote:

[originally pull request #303; closed because the base branch was deleted]

I'm grabbing the relevant comment from the closed pull request.

@betanalpha wrote:

so... what I want to do is: 1) make sure the fix is correct 2) try to prevent this from breaking in the future

I understand, but we don't really have a benchmark for "correct". I've just been comparing to Matt's implementation assuming that it was correct.

To be specific I would add a bunch of verbose outputs and look at the trees as they were built between the two versions of the code. Hacking the RNGs to be in phase and the like wasn't exactly a scalable solution.

If you're convinced this fix is correct, can you spell out:

• model you're using to test • the step size you're using (and if you can write out the full command line parameters you're using) • what you're doing with the samples afterwards. Is the proportion of (epsilon * 2^treedepth) that's between 1/4 pi and 3/4 pi for a single chain how you're looking at it? For an iid gaussian model with unit metric and a fixed, small step size (say 1e-3 or so), the tree should terminate after traversing L = pi / epsilon states. Accounting for the fact that the algorithm won't build a full branch if a subtree fails the NUTS criterion gives the quoted range. Making the step size smaller and tracking the actual number of states in the slice instead of just estimating the number of states based on the tree depth would yield a tighter range of possible values.

This hold for any transition at any point in the iid gaussian case (module some step size considerations, but it's easy to seed this case well anyways), so warmup isn't an issue. Run one chain for a while and look at the distribution, or look at the distribution across multiple chains.

If we decide on the correctness of an implementation then we can measure the distribution of integration times and use that to test any future changes.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/326#issuecomment-27028294 .

betanalpha commented 10 years ago

The issue is (a) what models to tests and (b) how to define a "test" given the stochastic nature of the estimates. Not to mention interaction with adaptation and the like.

In particular, I never would have noticed the small bugs in 2.0 without the red herring last weekend and I never would have noticed the nontrivial difference they can have on sampling if I wasn't running many tests on a big model with known ground truth and looking at particularly sensitive estimates. Even then, the difference was pretty small.

I'm in support setting up a full suite of sampler tests where we decide upon a reasonable means of running ensembles of tests to get around much of the stochasticness and then carefully choose a suite of known pathological models and exactly which estimates to test and what thresholds to use in those tests. But we're not close to that at the moment.

The same amount of testing that went into NUTS 2.0 has been repeated on this fix, in fact it's been more rigorous given the known issues with the pathological model mentioned above. I just want to patch the existing algorithm with the hot fix as soon as possible given that the current implementation is known to be slightly biased.

On Oct 25, 2013, at 1:28 AM, Matt Hoffman notifications@github.com wrote:

If we have a trusted reference implementation, then can't we just test if the series of samples generated from some model are identical given identical seeds?

On Thu, Oct 24, 2013 at 1:46 PM, Michael Betancourt < notifications@github.com> wrote:

So what's the next step? I can write a test but it'll really just be for show given the points raised below.

Also, given that there significant improvements in the tails of hierarchical models I have motivation for getting this into a tagged release sooner rather than later so I can quote the version in an upcoming paper.

On Oct 24, 2013, at 8:31 PM, Daniel Lee notifications@github.com wrote:

[originally pull request #303; closed because the base branch was deleted]

I'm grabbing the relevant comment from the closed pull request.

@betanalpha wrote:

so... what I want to do is: 1) make sure the fix is correct 2) try to prevent this from breaking in the future

I understand, but we don't really have a benchmark for "correct". I've just been comparing to Matt's implementation assuming that it was correct.

To be specific I would add a bunch of verbose outputs and look at the trees as they were built between the two versions of the code. Hacking the RNGs to be in phase and the like wasn't exactly a scalable solution.

If you're convinced this fix is correct, can you spell out:

• model you're using to test • the step size you're using (and if you can write out the full command line parameters you're using) • what you're doing with the samples afterwards. Is the proportion of (epsilon * 2^treedepth) that's between 1/4 pi and 3/4 pi for a single chain how you're looking at it? For an iid gaussian model with unit metric and a fixed, small step size (say 1e-3 or so), the tree should terminate after traversing L = pi / epsilon states. Accounting for the fact that the algorithm won't build a full branch if a subtree fails the NUTS criterion gives the quoted range. Making the step size smaller and tracking the actual number of states in the slice instead of just estimating the number of states based on the tree depth would yield a tighter range of possible values.

This hold for any transition at any point in the iid gaussian case (module some step size considerations, but it's easy to seed this case well anyways), so warmup isn't an issue. Run one chain for a while and look at the distribution, or look at the distribution across multiple chains.

If we decide on the correctness of an implementation then we can measure the distribution of integration times and use that to test any future changes.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/326#issuecomment-27028294 .

— Reply to this email directly or view it on GitHub.

syclik commented 10 years ago

I think there are two levels of testing that we need to do. We need unit tests in there as well as higher level tests that check on the stochastic behavior of these algorithms.

On Thu, Oct 24, 2013 at 8:28 PM, Matt Hoffman notifications@github.comwrote:

If we have a trusted reference implementation, then can't we just test if the series of samples generated from some model are identical given identical seeds?

I'm reading this to mean that we should be unit testing. That's running these things with known inputs and comparing it to known outputs. If necessary, we can introduce a mock framework for the random number generator in our test, but I think that's overkill. Setting a seed will do be sufficient, I think. The key, here, is that we're not testing stochastic behavior. We want to verify that the code does the right thing and we're going to define the "right thing" by a reference implementation. Yes, it's circular, but what this does is prevent unintentional changes to the underlying implementation.

On Fri, Oct 25, 2013 at 3:55 AM, Michael Betancourt < notifications@github.com> wrote:

The issue is (a) what models to tests and (b) how to define a "test" given the stochastic nature of the estimates. Not to mention interaction with adaptation and the like.

These are the higher level tests. These are also necessary, but these tests are a lot harder to define.

In particular, I never would have noticed the small bugs in 2.0 without the red herring last weekend and I never would have noticed the nontrivial difference they can have on sampling if I wasn't running many tests on a big model with known ground truth and looking at particularly sensitive estimates. Even then, the difference was pretty small.

So there are two things here: if you've actually fixed a bug, we should be able to write a unit test that demonstrates it. It might be hard to break it down to the unit test level, but we should. The deterministic calculations should be right and tested.

With the stochastic test, this goes back to Matt's point about reference implementation and a particular behavior. If you can specify one big model, one seed, one set of ground truths, we can generate the samples exactly. We should be able to determine for that one model with that one seed that we see the better behavior. Of course, this doesn't give us the confidence that it does the right thing everywhere, but that's where your next paragraph comes in.

I'm in support setting up a full suite of sampler tests where we decide upon a reasonable means of running ensembles of tests to get around much of the stochasticness and then carefully choose a suite of known pathological models and exactly which estimates to test and what thresholds to use in those tests. But we're not close to that at the moment.

I'm also in support of this. Let's start with one. It doesn't even have to be pathological for it to be a valid test.

One thing you mentioned is the iid gaussian and epsilon * 2^treedepth being between 1/4 pi and 3/4 pi. Under what conditions? I wasn't able to verify that behavior using NUTS using the default configuration. So if you were able to see it, you've got to spell it out so I can recreate it.

The same amount of testing that went into NUTS 2.0 has been repeated on this fix, in fact it's been more rigorous given the known issues with the pathological model mentioned above.

I believe you when you say you've tested NUTS for v2.0.0 and for hotfix/fix_nuts. I also believe that you've tested it more rigorously.

What I want is to capture what you've done to convince yourself that it's tested and correct. I want to get that into a repeatable test and into our code base. As of right now, you're our test. If I made any change to any part of the implementation, the most reliable way to know whether or not it doesn't break anything is to wait for you to give me the thumbs up or thumbs down. Let's just pull that into the test suite somehow.

I just want to patch the existing algorithm with the hot fix as soon as possible given that the current implementation is known to be slightly biased.

Ok. Can you give me a model (or more if you've got them), the full configuration (including random seed), and how to determine the bias? If you've got that, I will write a test that checks that every time we run the full test suite.


With all of that said, I'm guilty of creating some tests that are useless, that we ignore all the time. I need to fix the model tests, which we actually should be able to do now. It looks like we're sampling well enough with Bob's fix to multi-normal. The models that use to not converge now pretty much converge with the number of iterations. This is good. I'll get on that too.

On Oct 25, 2013, at 1:28 AM, Matt Hoffman notifications@github.com wrote:

If we have a trusted reference implementation, then can't we just test if the series of samples generated from some model are identical given identical seeds?

On Thu, Oct 24, 2013 at 1:46 PM, Michael Betancourt < notifications@github.com> wrote:

So what's the next step? I can write a test but it'll really just be for show given the points raised below.

Also, given that there significant improvements in the tails of hierarchical models I have motivation for getting this into a tagged release sooner rather than later so I can quote the version in an upcoming paper.

On Oct 24, 2013, at 8:31 PM, Daniel Lee notifications@github.com wrote:

[originally pull request #303; closed because the base branch was deleted]

I'm grabbing the relevant comment from the closed pull request.

@betanalpha wrote:

so... what I want to do is: 1) make sure the fix is correct 2) try to prevent this from breaking in the future

I understand, but we don't really have a benchmark for "correct". I've just been comparing to Matt's implementation assuming that it was correct.

To be specific I would add a bunch of verbose outputs and look at the trees as they were built between the two versions of the code. Hacking the RNGs to be in phase and the like wasn't exactly a scalable solution.

If you're convinced this fix is correct, can you spell out:

• model you're using to test • the step size you're using (and if you can write out the full command line parameters you're using) • what you're doing with the samples afterwards. Is the proportion of (epsilon * 2^treedepth) that's between 1/4 pi and 3/4 pi for a single chain how you're looking at it? For an iid gaussian model with unit metric and a fixed, small step size (say 1e-3 or so), the tree should terminate after traversing L = pi / epsilon states. Accounting for the fact that the algorithm won't build a full branch if a subtree fails the NUTS criterion gives the quoted range. Making the step size smaller and tracking the actual number of states in the slice instead of just estimating the number of states based on the tree depth would yield a tighter range of possible values.

This hold for any transition at any point in the iid gaussian case (module some step size considerations, but it's easy to seed this case well anyways), so warmup isn't an issue. Run one chain for a while and look at the distribution, or look at the distribution across multiple chains.

If we decide on the correctness of an implementation then we can measure the distribution of integration times and use that to test any future changes.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/stan-dev/stan/issues/326#issuecomment-27028294> .

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/326#issuecomment-27071753 .

betanalpha commented 10 years ago

so very long; didn't read:

Taking the hotfix as the correct behavior we can build an "alarm if anything changes test" (as opposed to a "alarm if anything is wrong test") using a constant seed and adaptation turned off. Technically it still couples all of the HMC related code so it wouldn't fully isolate the NUTS algorithm, but I think it's good enough for Daniel's goals.

We can also add a crude test of statistical validity but it's gonna be super crude and personally I would rather build up a proper statistical testing framework

Comments inline.

I'm reading this to mean that we should be unit testing. That's running these things with known inputs and comparing it to known outputs. If necessary, we can introduce a mock framework for the random number generator in our test, but I think that's overkill. Setting a seed will do be sufficient, I think. The key, here, is that we're not testing stochastic behavior. We want to verify that the code does the right thing and we're going to define the "right thing" by a reference implementation. Yes, it's circular, but what this does is prevent unintentional changes to the underlying implementation.

I think we're all agreed -- the problem is that we don't have that valid reference, at least not without going all the way back to v1.3.0. Yes, this means that we really should have put in tests before v2.0.

The only point I'm trying to make is that we should fix the current known issue with the NUTS implementation and then build some proper tests, rather than throw a test up that will pass tautologically.

So there are two things here: if you've actually fixed a bug, we should be

able to write a unit test that demonstrates it. It might be hard to break it down to the unit test level, but we should. The deterministic calculations should be right and tested.

But here's the issue -- NUTS in 2.0 isn't exactly the same as it is in 1.3. The refactor to enable NUTS for RHMC uses an ever so slightly different termination criterion, the difference between the two being O(epsilon^{2}). Statistically this makes no real difference, but it can change the exact construction of a tree enough to prevent a deterministic test. Once we settle on the correct behavior for the new implementation then we can use it as a baseline for a deterministic tree construction test, but right now we have no baseline.

Of course there's also the issue of slightly different RNG behavior and the like that makes a deterministic comparison of a full chain difficult (recall the fact that the Boost RNGs burn through a sample whenever an interface gets constructed, so it's almost impossible to compare v1.3 to v2.0 at face value).

With the stochastic test, this goes back to Matt's point about reference implementation and a particular behavior. If you can specify one big model, one seed, one set of ground truths, we can generate the samples exactly. We should be able to determine for that one model with that one seed that we see the better behavior. Of course, this doesn't give us the confidence that it does the right thing everywhere, but that's where your next paragraph comes in.

I think this is more in line with the stan_csv_reader tests you wrote -- they don't test for correct behavior just any change in behavior. Fundamentally I dislike their purpose, but in practice I do see how they serve to alert to any unintended changes.

Given a proper reference behavior I am not against implementing some of these.

I'm also in support of this. Let's start with one. It doesn't even have to be pathological for it to be a valid test.

One thing you mentioned is the iid gaussian and epsilon * 2^treedepth being between 1/4 pi and 3/4 pi. Under what conditions? I wasn't able to verify that behavior using NUTS using the default configuration. So if you were able to see it, you've got to spell it out so I can recreate it.

Right -- you have to drop the step size to 0.01 or 0.001 and force the same initial p and q. I did it by hacking the code, we would have to refactor the v2.0 a bit to automate this but it's not impossible.

But note also that the results weren't EXACTLY the same because of the aforementioned update to the algorithm details. I tested the two by comparing the trees subtree-by-subtree and being able to judge whether any differences were due to the expected differences or not.

Ok. Can you give me a model (or more if you've got them), the full configuration (including random seed), and how to determine the bias? If you've got that, I will write a test that checks that every time we run the full test suite.

So I was using a 100+1 dimensional funnel and looking at the 5% percentile of the latent parameter v. There are two issues in making this a proper test: first you can't run the funnel out of the box (instead you have to set the stepsize to a small value by hand) and second I have no idea what the Monte Carlo sampling distributions for the percentiles are so I have no idea how we would build a proper test, even when you know the right values.

In other words, right now we can test that the 5% is at -5 and the 95% is a 5, but the actual numerical threshold is up in the air. Something like 0.1 would have caught the bias here.

matthewdhoffman commented 10 years ago

I'm reading this to mean that we should be unit testing. That's running these things with known inputs and comparing it to known outputs. If necessary, we can introduce a mock framework for the random number generator in our test, but I think that's overkill. Setting a seed will do be sufficient, I think. The key, here, is that we're not testing stochastic behavior. We want to verify that the code does the right thing and we're going to define the "right thing" by a reference implementation. Yes, it's circular, but what this does is prevent unintentional changes to the underlying implementation.

Exactly. I agree that we need a framework for testing new sampling algorithms for correctness, but I don't think that should be run as a unit test. A sampling algorithm should not be included in Stan until we have tested that it converges to the target distribution. But once we've verified that once, why should we bother testing the stochastic behavior of the algorithm? The point of a unit test is to verify that the code isn't broken—if we define "broken" to mean "does not generate the samples the reference implementation did when given the same model, data, tunable parameters, and seed" then we don't need to worry about doing Monte Carlo verification, which is inherently expensive and problematic.

To summarize, what seems to me to make the most sense to is to:

  1. Have a framework/process for verifying correctness of new sampling algorithms that are candidates for inclusion in Stan and
  2. Have unit tests that ensure that implementations of already verified sampling algorithms have not been broken. I'm defining a "new" sampling algorithm pretty liberally to mean any new algorithm or implementation of an existing algorithm that for whatever reason can't pass an existing unit test based on samples being exactly identical. For example, if you permute the order of calls to the RNG, that qualifies as a "new" algorithm, and requires the more stringent verification process.

As far as #1 goes, I'm in favor of making this as simple as possible. My proposal would be to

  1. Run proposed algorithm for a long time (maybe 1M samples?) on a model with known statistics—say an overlapping two-component mixture of 2D Gaussians* (I'm against using a single Gaussian, since we want to make sure we handle asymmetrical distributions correctly).
  2. Compute first and second central moments of the sample.
  3. Estimate ESS and standard deviations of those statistics, compute MCMC standard errors.
  4. Check that our estimates from (2) are consistent with the known statistics up two 2 or 3 standard errors.

My feeling is that that's going to be more robust than trying to look at things like distributions of integration times on a Gaussian etc., and that only a really pathologically weird bug is going to get the first and second moments on a mixture of Gaussians right but get something else wrong.

Matt

transformed data { vector[2] alpha; matrix[2, 2] Sigma1; matrix[2, 2] Sigma2; vector[2] mu1; vector[2] mu2;

real scale; alpha[1] <- 0.5; alpha[2] <- 0.5;

scale <- square(10.0);

Sigma1[1, 1] <- scale; Sigma1[2, 2] <- scale; Sigma1[1, 2] <- scale * 0.5; Sigma1[2, 1] <- Sigma1[1, 2]; mu1[1] <- -scale; mu1[2] <- -scale;

Sigma1[1, 1] <- square(0.5) * scale; Sigma1[2, 2] <- scale; Sigma1[1, 2] <- 0; Sigma1[2, 1] <- 0; mu1[1] <- scale; mu1[2] <- scale; } parameters { vector[2] y; } model { lp <- lp + log_sum_exp(log(alpha[1]) + multi_normal_log(y, mu1, Sigma1), log(alpha[2]) + multi_normal_log(y, mu2, Sigma2)); }

betanalpha commented 10 years ago

To summarize, what seems to me to make the most sense to is to:

  1. Have a framework/process for verifying correctness of new sampling algorithms that are candidates for inclusion in Stan and
  2. Have unit tests that ensure that implementations of already verified sampling algorithms have not been broken. I'm defining a "new" sampling algorithm pretty liberally to mean any new algorithm or implementation of an existing algorithm that for whatever reason can't pass an existing unit test based on samples being exactly identical. For example, if you permute the order of calls to the RNG, that qualifies as a "new" algorithm, and requires the more stringent verification process.

I think the only difficulty here is in determining when a change is either (1) or (2). To be safe we'll probably end up running the stringent tests in (1) with every pull request and they'll end up being practically identical to unit tests. Although I would be in support of a naming distinction -- unit_tests and estimator_tests or the like.

As far as #1 goes, I'm in favor of making this as simple as possible. My proposal would be to

  1. Run proposed algorithm for a long time (maybe 1M samples?) on a model with known statistics—say an overlapping two-component mixture of 2D Gaussians* (I'm against using a single Gaussian, since we want to make sure we handle asymmetrical distributions correctly).
  2. Compute first and second central moments of the sample.
  3. Estimate ESS and standard deviations of those statistics, compute MCMC standard errors.
  4. Check that our estimates from (2) are consistent with the known statistics up two 2 or 3 standard errors.

Yup. The only issue is the false positives in (4) since we run lots of tests. To be more robust I think we'd want to run multiple chains and try to validate the sampling distribution of (\hat{x} - x_true) / MCSE via the ratio of values inside and outside of some sigma band. In any case, I think we're all on the same general page for estimation tests.

Also, we'd want to include known pathological distributions like the funnel and the banana.

My feeling is that that's going to be more robust than trying to look at things like distributions of integration times on a Gaussian etc.,

Side note -- this was meant as a test of the validity of the NUTS implementation, since in the Gaussian case we know exactly when the tree should be terminating, not the validity of the estimates. So really more of a proper unit test, modulo the random inits.

matthewdhoffman commented 10 years ago

Yup. The only issue is the false positives in (4) since we run lots of tests. To be more robust I think we'd want to run multiple chains and try to validate the sampling distribution of (\hat{x} - x_true) / MCSE via the ratio of values inside and outside of some sigma band. In any case, I think we're all on the same general page for estimation tests.

Will we really be running lots of tests? It seems like we should only need to run the randomized tests once per new sampling/adaptation algorithm.

But I nonetheless think that that proposal makes sense. If I understand correctly, the idea is to replicate the experiment a number of times and make sure that the range of averages we're getting are consistent with the CLT? The only downside is that it's more expensive.

Also, we'd want to include known pathological distributions like the funnel and the banana.

I think we should separate sampler correctness from sampler robustness. The funnel and banana are hard, which makes them poor choices for evaluating whether a sampler converges to the correct distribution. We should have a separate test suite that evaluates sampler efficiency on various kinds of problems, including both synthetic examples like the funnel and banana and real-world models (with and without efficient reparameterizations that eliminate funnel-like behavior if appropriate).

Matt

syclik commented 10 years ago

On Fri, Oct 25, 2013 at 2:41 PM, Matt Hoffman notifications@github.comwrote:

Yup. The only issue is the false positives in (4) since we run lots of tests. To be more robust I think we'd want to run multiple chains and try to validate the sampling distribution of (\hat{x} - x_true) / MCSE via the ratio of values inside and outside of some sigma band. In any case, I think we're all on the same general page for estimation tests.

Will we really be running lots of tests? It seems like we should only need to run the randomized tests once per new sampling/adaptation algorithm.

But I nonetheless think that that proposal makes sense. If I understand correctly, the idea is to replicate the experiment a number of times and make sure that the range of averages we're getting are consistent with the CLT? The only downside is that it's more expensive.

Also, we'd want to include known pathological distributions like the funnel and the banana.

I think we should separate sampler correctness from sampler robustness. The funnel and banana are hard, which makes them poor choices for evaluating whether a sampler converges to the correct distribution. We should have a separate test suite that evaluates sampler efficiency on various kinds of problems, including both synthetic examples like the funnel and banana and real-world models (with and without efficient reparameterizations that eliminate funnel-like behavior if appropriate).

True.

Maybe we should have a test suite sitting outside the Stan repo that does the efficiency and robustness tests as a standalone project. Although it might be nice to have it inside the repo itself, it does seem like it's conceptually a project that's designed to evaluate MCMC algorithms (in Stan).

betanalpha commented 10 years ago

Will we really be running lots of tests? It seems like we should only need to run the randomized tests once per new sampling/adaptation algorithm.

Or per change/tweak in each algorithm. I imagine those will be more than enough (especially counting the multiple models we might test each time) to cause false positives to be common.

But I nonetheless think that that proposal makes sense. If I understand correctly, the idea is to replicate the experiment a number of times and make sure that the range of averages we're getting are consistent with the CLT? The only downside is that it's more expensive.

Right.

I think we should separate sampler correctness from sampler robustness. The funnel and banana are hard, which makes them poor choices for evaluating whether a sampler converges to the correct distribution. We should have a separate test suite that evaluates sampler efficiency on various kinds of problems, including both synthetic examples like the funnel and banana and real-world models (with and without efficient reparameterizations that eliminate funnel-like behavior if appropriate).

I'm not so much concerned with efficiency as bias. HMC has a particularly nasty habit of diverging when the curvature gets too large causing biased samples. It's particularly ubiquitous in hierarchical models (we're writing some papers on it now) and its exactly those kinds of models that really stress the validity of a sampler.

betanalpha commented 10 years ago

Maybe we should have a test suite sitting outside the Stan repo that does the efficiency and robustness tests as a standalone project. Although it might be nice to have it inside the repo itself, it does seem like it's conceptually a project that's designed to evaluate MCMC algorithms (in Stan).

Why not inside it? Like a parallel test suite that Jenkins can run if requested?

The strategy might be general, but any implementation would depend on the interface to the samplers which wouldn't be general at all. Plus we have to be careful about inappropriate defaults, which would require substantial tuning.

bob-carpenter commented 10 years ago

On 10/25/13 3:40 PM, Michael Betancourt wrote:

Will we really be running lots of tests? It seems like we should only need to run the randomized tests once per new sampling/adaptation algorithm.

The problem is that any change to something like auto-diff or one of the distributions is going to percolate through to the model tests.

It'd be nice if we had clean, independent unit tests for all the subparts, but we're not there yet. So we're being sloppy and hoping some high-level functional tests clean up.

We're prioritizing getting the multivariate tests in place. I added more for the multi_normal for 2.0.1, but they're not nearly as complete as the univariate tests.

One upside is that the fix to multi_normal really sped up model testing --- from 6 hours down to 3 hours.

Or per change/tweak in each algorithm. I imagine those will be more than enough (especially counting the multiple models we might test each time) to cause false positives to be common.

But I nonetheless think that that proposal makes sense. If I understand correctly, the idea is to replicate the experiment a number of times and make sure that the range of averages we're getting are consistent with the CLT? The only downside is that it's more expensive.

Right.

I think we should separate sampler correctness from sampler robustness. The funnel and banana are hard, which makes them poor choices for evaluating whether a sampler converges to the correct distribution. We should have a separate test suite that evaluates sampler efficiency on various kinds of problems, including both synthetic examples like the funnel and banana and real-world models (with and without efficient reparameterizations that eliminate funnel-like behavior if appropriate).

I'm not so much concerned with efficiency as bias. HMC has a particularly nasty habit of diverging when the curvature gets too large causing biased samples. It's particularly ubiquitous in hierarchical models (we're writing some papers on it now) and its exactly those kinds of models that really stress the validity of a sampler.

But isn't the HMC sampler guaranteed to be right asymptotically?

The problem I see is how long we have to run, so it seems like this is more of a test of speed of convergence and convergence diagnostics.

For the funnel, we do know the "posterior" analytically, so we know what the variation should be on the hierarchical scale, so we can test that.

bob-carpenter commented 10 years ago

We also have to worry if we add samplers like Metropolis, which we may not be able to get to converge in our lifetimes on some of the models we'd like to test. I think this is the same issue conceptually as HMC not getting into the tails. But it's not a bug per se, so much as a not-so-great sampling algorithm.

I'd very much like to separate bugs in implementation from slow mixing! The latter we can't really fix if we want to roll out, say, a random walk Metropolis algorithm as part of Stan.

On 10/25/13 3:42 PM, Michael Betancourt wrote:

Maybe we should have a test suite sitting outside the Stan repo that does the efficiency and robustness tests as a standalone project. Although it might be nice to have it inside the repo itself, it does seem like it's conceptually a project that's designed to evaluate MCMC algorithms (in Stan).

Why not inside it? Like a parallel test suite that Jenkins can run if requested?

The strategy might be general, but any implementation would depend on the interface to the samplers which wouldn't be general at all. Plus we have to be careful about inappropriate defaults, which would require substantial tuning.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/326#issuecomment-27120060.

matthewdhoffman commented 10 years ago

I think we should separate sampler correctness from sampler robustness. The funnel and banana are hard, which makes them poor choices for evaluating whether a sampler converges to the correct distribution. We should have a separate test suite that evaluates sampler efficiency on various kinds of problems, including both synthetic examples like the funnel and banana and real-world models (with and without efficient reparameterizations that eliminate funnel-like behavior if appropriate).

I'm not so much concerned with efficiency as bias. HMC has a particularly nasty habit of diverging when the curvature gets too large causing biased samples. It's particularly ubiquitous in hierarchical models (we're writing some papers on it now) and its exactly those kinds of models that really stress the validity of a sampler.

In what sense? HMC always has the correct stationary distribution, yes? If you ran it for "long enough" (read: trillions of trillions of millenia, or maybe longer) it would eventually converge and give you unbiased estimates. By correctness I just mean having the correct stationary distribution, not converging to it in a reasonable amount of time for difficult models. I still think we want to separate our diagnoses of extremely low efficiency (even as low as 1e-100 ESS/year) from diagnoses of incorrect stationary distributions (0 ESS/year).

Matt

betanalpha commented 10 years ago

But isn't the HMC sampler guaranteed to be right asymptotically?

Not if the step size is too big!

The problem I see is how long we have to run, so it seems like this is more of a test of speed of convergence and convergence diagnostics.

For the funnel, we do know the "posterior" analytically, so we know what the variation should be on the hierarchical scale, so we can test that.

Right, we just have to know the sampling distribution of whatever estimate we're considering.

betanalpha commented 10 years ago

We also have to worry if we add samplers like Metropolis, which we may not be able to get to converge in our lifetimes on some of the models we'd like to test.

You're talking to the dude running some Metropolis/Gibbs samplers right now pushing 2 effective samples per hour. How did anyone do science before HMC?!?!

betanalpha commented 10 years ago

In what sense? HMC always has the correct stationary distribution, yes? If you ran it for "long enough" (read: trillions of trillions of millenia, or maybe longer) it would eventually converge and give you unbiased estimates. By correctness I just mean having the correct stationary distribution, not converging to it in a reasonable amount of time for difficult models. I still think we want to separate our diagnoses of extremely low efficiency (even as low as 1e-100 ESS/year) from diagnoses of incorrect stationary distributions (0 ESS/year).

HMC is theory always gets the right stationary distribution but once you use a numerical integrator things can get complicated (with the inevitable argument between "wrong" and "right after (\infty - 1) iterations"). Really it all depends on exactly how the integrator diverges.

For example, run a 100+1 funnel and adapt to an optimal acceptance probability of 0.65 (the default). You can run for any practical amount of time and you'll never get deep enough into the funnel.

The issue we want to diagnose is when increasing the target acceptance probability doesn't fix the bias. This is what I found with the current NUTS algorithm in develop that verified small, almost imperceivable errors.

betanalpha commented 10 years ago

But seriously, what needs to be done before this branch can be considered for a merge?

bob-carpenter commented 10 years ago

On 10/25/13 4:09 PM, Michael Betancourt wrote:

We also have to worry if we add samplers like Metropolis, which we may not be able to get to converge in our lifetimes on some of the models we'd like to test.

You're talking to the dude running some Metropolis/Gibbs samplers right now pushing 2 effective samples per hour. How did anyone do science before HMC?!?!

I believe point estimation was very popular.

Gibbs isn't that much slower than HMC in a lot of models.