probmods / webppl

Probabilistic programming for the web
http://webppl.org
Other
614 stars 89 forks source link

Implement HMC #81

Closed ngoodman closed 8 years ago

ngoodman commented 9 years ago

Following the ideas in Nonstandard Interpretations of Probabilistic Programs for Efficient Inference we can implement Hamiltonian Monte Carlo (HMC) using automatic differentiation (AD).

HMC is useful in models with many continuous parameters. This seems especially helpful for bayesian data analysis, where we are often dealing with parameters per subject and condition (i.e. thousands of continuous parameters).

I believe @iffsid already has AD working using sweet.js, so it would perhaps be straightforward to do HMC?

The Stan project has some good tricks for making HMC work in practice. Worth looking at if we start using HMC a lot.

With HMC implemented, evaluate combinations such as HMC rejuvenation in SMC.

iffsid commented 8 years ago

The ad branch (current b24e282ba05ef60a0e2a60037b9b412ebe3cb363 ) has both HMC and HMC rejuvenation in SMC (HSMC) implemented. HMC can be run with a repetition-set of kernels (e.g. ['mh', 'leapfrog'], etc.) or just as ['mh'] or just as ['leapfrog']. Same applies to the rejuvenation in SMC.

A simple example for HSMC

var model = function() {
  var mu = gaussian(0, 2);
  factor(gaussianERP.score([mu, 1], 5.5));
  factor(gaussianERP.score([mu, 1], 6.5));
  return mu;
};

var hsmcOpts = {numParticles: 1000,
                kernelOpts: {stepSize: 0.1,
                             steps: 5,
                             iterations: 20,
                             kernels: ['leapfrog'],
                             aggregator: 'score'},
                aggregator: 'score'};

var erp = HSMC(model, hsmcOpts);
console.log("logZ", erp.normalizationConstant)
ad.untapify(expectation(erp))
// => 5.3 (approximately)

The codebase is fairly clean, with a decent amount of refactoring of trace and particle data-structures that helps keep working with stateful ad sane. Also, some work along the lines of #86 was put in to simplify the interface and usability. The aggregator option is used to decide if the histogram update should happen on 'count' or 'score' -- the latter for use when the support of the returned distribution is Real

I'm not closing the issue yet because the test-framework needs work to untapify the results where necessary so that the results match the expected results. Currently most things are a tape, unless explicitly asked to be untapified; a requirement when nesting models with ad.

null-a commented 8 years ago

Is this in a state where I can think about trying to merge it, or is there other work to do first?

iffsid commented 8 years ago

So, in order to have HMC we'd need to have support for a few things

Separate from this, in terms of generality, it would be nice if the score/weight updating code in the inference algorithms were abstracted out in some sense, because they would additionally need to be modified to use the overloaded operators (because they potentially handle score tapes rather than just scores). Also, the histogram construction.

Primarily this would be good because we won't need to litter the inference methods with calls to ad.add or ad.untapify, and also won't need to require ad.js within those files either.

To see a simple example of the change in inference code to support AD, take a look at Enumerate.js in the ad branch.

null-a commented 8 years ago

it would be nice if the score/weight updating code in the inference algorithms were abstracted out in some sense.

To see a simple example of the change in inference code to support AD, take a look at Enumerate.js in the ad branch.

I thought I understood this, but the more I've thought about it, the less sure of exactly what needs to happen I've become. Here's what I've been thinking:

HMC(function() {
  var x = gaussian(0, 1)
  Enumerate(function() {
    var y = flip()
    factor(x)
  })
})

... then Enumerate is going to be passed a tape (through factor) and so had better deal with that, as your branch does.

HMC(function() {
  var p = uniform(0, 1)
  Enumerate(function() {
    flip(p)
  })
})

... or even as an option to an inference algorithm. I don't think Enumerate in the ad branch handles this, right?

For example, it seems this:

HMC(function() {
  var x = uniform(0, 1);
  var y = flip(x)
  factor(y ? 0 : -1)
  x
})

... should be equivalent to this:

var makeCoin = function(p) {
  Enumerate(function() {
    flip(p)
  })
};

HMC(function() {
  var x = uniform(0, 1);
  var coin = makeCoin(x);
  var y = sample(coin);
  factor(y ? 0 : -1)
  x
})

... but I don't think they are at present. In the first, the computed gradient (w.r.t. x) correctly depends on x, but in the second the computed gradient (w.r.t x) is always zero. Maybe this what you meant by "Also, the histogram construction" which is perhaps where propagation stops at present?

So with all that in mind it seems to me that:

  1. If we don't support nesting inference (within HMC), algorithms like Enumerate can remain largely unchanged.
  2. If we do support nesting inference (which I assume we will) then it's hard to see how to do so in a clean way without applying the ad transform to all of the inference code.

Does that sound about right or am I missing something?

ngoodman commented 8 years ago

that all sounds right to me. (@iffsid do you agree?)

i think it would be fine to first implement the version that doesn't handle nested inference inside HMC and open an issue for expending to the nested case.

to fully extend to the nested case i think you're right that all of the inference code needs to correctly compute with tapes....

-N

On Mon, Nov 2, 2015 at 1:35 PM, null-a notifications@github.com wrote:

it would be nice if the score/weight updating code in the inference algorithms were abstracted out in some sense.

To see a simple example of the change in inference code to support AD, take a look at Enumerate.js in the ad branch.

I thought I understood this, but the more I've thought about it, the less sure of exactly what needs to happen I've become. Here's what I've been thinking:

-

I originally thought that the score updating code (such as that which you pointed out in Enumerate) needed to be changed because the score functions are transformed, leading to inference algorithms always having to deal with tapes. But I don't think that's the case, is it? If the ad operators (ad.add etc.) are passed numbers they return numbers, so as long as the inference algorithm doesn't create any tapes (and isn't handed any) then it can work unmodified with the transformed score functions.

One case where the score updating code (e.g. in Enumerate) does need to use the ad operators is nested inference. If you have something like this:

HMC(function() { var x = gaussian(0, 1) Enumerate(function() { var y = flip() factor(x) }) })

... then Enumerate is going to be passed a tape (through factor) and so had better deal with that, as your branch does.

  • Of course, there are other ways tapes could be passed to nested inference algorithms, e.g. as an ERP parameter through sample:

HMC(function() { var p = uniform(0, 1) Enumerate(function() { flip(p) }) })

... or even as an option to an inference algorithm. I don't think Enumerate in the ad branch handles this, right?

  • Am I right in thinking it's necessary to propagate gradient information all the way through e.g. Enumerate?

For example, it seems this:

HMC(function() { var x = uniform(0, 1); var y = flip(x) factor(y ? 0 : -1) x })

... should be equivalent to this:

var makeCoin = function(p) { Enumerate(function() { flip(p) }) }; HMC(function() { var x = uniform(0, 1); var coin = makeCoin(x); var y = sample(coin); factor(y ? 0 : -1) x })

... but I don't think they are at present. In the first, the computed gradient (w.r.t. x) correctly depends on x, but in the second the computed gradient (w.r.t x) is always zero. Maybe this what you meant by "Also, the histogram construction" which is perhaps where propagation stops at present?

So with all that in mind it seems to me that:

  1. If we don't support nesting inference (within HMC), algorithms like Enumerate can remain largely unchanged.
  2. If we do support nesting inference (which I assume we will) then it's hard to see how to do so in a clean way without applying the ad transform to all of the inference code.

Does that sound about right or am I missing something?

— Reply to this email directly or view it on GitHub https://github.com/probmods/webppl/issues/81#issuecomment-153118693.

iffsid commented 8 years ago

@ngoodman @null-a Yes I believe this is correct.

In order to handle the non-nested case, you shouldn't need to do much. The only changes as far as I can see is to handle proposing to multiple erps, keeping track of reuse for each of them, and allowing traces to be deep-copies.

null-a commented 8 years ago

HMC merge todo:

null-a commented 8 years ago

If anyone wants to try this out in its current state I think you'll need to:

To use HMC in a program you need to do something like:

MCMC(model, { samples: 100, kernel: [HMCKernel, MHDiscrete] })

I've updated my linear regression demo which shows HMC working in the browser. (Inference sometimes fails when initialized to a low probability state in this demo. You might need to refresh the page and try again to get it running.)

stuhlmueller commented 8 years ago

Neat! It could also be interesting to visualize samples for an example where it's easier to see that MH and HMC differ, such as for an elliptical 2D gaussian (as on page 17 in this paper).

ngoodman commented 8 years ago

fun demo!! red means rejected?

-N

On Wed, Nov 4, 2015 at 9:07 AM, Andreas Stuhlmüller < notifications@github.com> wrote:

Neat! It could also be interesting to visualize samples for an example where it's easier to see that MH and HMC differ, such as for an elliptical 2D gaussian (as on page 17 in this paper http://www.cs.utoronto.ca/%7Eradford/ftp/ham-mcmc.pdf).

— Reply to this email directly or view it on GitHub https://github.com/probmods/webppl/issues/81#issuecomment-153793953.

null-a commented 8 years ago

red means rejected?

Sorry - yes.

null-a commented 8 years ago

There are a few things I'm unsure about:

Structural continuous variables. It's not obvious to me that there's a way to handle variables popping into/out of existence during HMC. Is there? Or do we assume no structure change and leave it to MH?

Constraints. I added explicit handling of constraints (imposed by the support of a variable) because I wasn't convinced that just rejecting the proposal when the score becomes -Infinity was correct. The way I'm doing this works in simple cases but runs into problems when the constraints themselves depend on other variables. e.g. uniform(0, uniform(0, 1)). Has anyone thought about this before?

Setting step size per-ERP. I think we'd need to do this (or something similar) in order to do inference in CP4.1. (If we want to include the noise variable in inference.) I guess one possible way to do it would be to extend sample something like: sample(gammaERP, [1, 1], { stepSize: 0.01 }). Is that something worth doing now and if so is this a good approach, or shall we tackle it later?

dritchie commented 8 years ago

I faced the same questions back when I was writing Quicksand. Here's what I remember:

Structural continuous variables: Yeah, it's definitely important to know when a continuous variable can affect control flow--just assuming they won't is not a good solution, I think. In Quicksand, I just had the user mark whether a random choice was structural or not (I think the default was that all choices were structural unless otherwise specified--being conservative this way makes the default always correct, if not the most efficient).

Constraints: The standard way to handle continuous variables with bounded domains seems to be to use transformations that move the variable into an unbounded space (i.e. for a variable lower bounded at zero, HMC can compute proposals on an unbounded variable, but this variable gets passed through an exp transform before the program sees it). You can look at Section 55 "Transformations of Constrained Variables" in the Stan Manual to get a better idea. The Quicksand source code also uses this technique (though it's a bit messy).

Step sizes: YMMV, but I've always preferred automatically adapting the step size, so that users don't have to set it manually. Section 3.2 of the NUTS paper provides one algorithm for doing this, and that's the one that I used in Quicksand. They present it as being a single adaptive step size for all variables, but there's no reason it couldn't be done per-variable as well.

ngoodman commented 8 years ago

good questions.

Structural continuous variables: while i think it might be well-defined even when variables change over the course of leapfrog.... it is really a research question. (basically does the symplectic integrator still conserve volume on a flag manifold.) for us, just noticing when a new continuous variable pops into existence (via the sample co-routine for hmc presumably?) and rejecting seems ok.

Constraints: what daniel said. also ok if we don't handle this optimally for now.

Step sizes: adaptation via pilot runs seems standard and reasonable. not sure if this works per erp? setting step sizes by hand is kind of fiddly.

overall, i would say that we don't need to get all of these issues ironed out initially. as we are working on applications that need them taken care of, we can go clean them up. (for me moving on to getting variational stuff going is a bit more urgent.)

-N

On Thu, Nov 5, 2015 at 11:38 AM, Daniel Ritchie notifications@github.com wrote:

I faced the same questions back when I was writing Quicksand http://dritchie.github.io/quicksand/. Here's what I remember:

Structural continuous variables: Yeah, it's definitely important to know when a continuous variable can affect control flow--just assuming they won't is not a good solution, I think. In Quicksand, I just had the user mark whether a random choice was structural or not (I think the default was that all choices were structural unless otherwise specified--being conservative this way makes the default always correct, if not the most efficient).

Constraints: The standard way to handle continuous variables with bounded domains seems to be to use transformations that move the variable into an unbounded space (i.e. for a variable lower bounded at zero, HMC can compute proposals on an unbounded variable, but this variable gets passed through an exp transform before the program sees it). You can look at Section 55 "Transformations of Constrained Variables" in the Stan Manual http://mc-stan.org/documentation/ to get a better idea. The Quicksand source code also uses this technique (though it's a bit messy).

Step sizes: YMMV, but I've always preferred automatically adapting the step size, so that users don't have to set it manually. Section 3.2 of the NUTS paper http://arxiv.org/pdf/1111.4246v1.pdf provides one algorithm for doing this, and that's the one that I used in Quicksand. They present it as being a single adaptive step size for all variables, but there's no reason it couldn't be done per-variable as well.

— Reply to this email directly or view it on GitHub https://github.com/probmods/webppl/issues/81#issuecomment-154168113.

stuhlmueller commented 8 years ago

Structural continuous variables: I'm not sure that just rejecting proposals that lead to structure change is correct, but if it is, we should do it. If we don't know that it is, we should throw an error. The latter would be equivalent to assuming (in the first HMC pull request) that HMC is only applied to models where all continuous variables are non-structural, which I think is a fine assumption initially. Then, in the next version, you could build on Daniel's proposal and add a way for users to mark which variables not to include in HMC steps.

Constraints: Daniel's approach seems best. A minor comment on your current implementation: if newVal == lower or newVal == upper, the adjustment block doesn't change newVal, but it will still fail the assertion below.

Step sizes: I'd go with automatic per-variable adaptation in the burn-in phase, too. This doesn't need to be part of the first HMC PR either.

null-a commented 8 years ago

Thanks everyone.

for me moving on to getting variational stuff going is a bit more urgent

Cool. I'll create a PR at the beginning of next week. We can use that to figure out if we're happy with what we have so far or if it needs more work before it's useful.