feature request: gradients of expected values

jacksonloper commented 6 years ago

Algorithmic construction of surrogates to estimate gradients of expected values has always seemed like a natural feature for tensorflow. I think we tried it a few years back but it never got off the ground. Maybe the time is now. Possibly even using modern surrogates such as dice, that accomodate higher order derivatives. There is also some rumbling about this in the edward community (cf. this issue), but I thought I would mention it here to see what the tensorflow probability community thought.

If you're not familiar with the so-called "stochastic computational graph" (SCG) scene, the bottom line is this:

Say we want to estimate the gradient of the expected value of a random variable with respect to some parameters. If we can use the reparametrization trick then it turns out to be really easy -- but in many cases that trick doesn't apply. In particular, consider the following case:

Say loss is a random tensor, whose distribution is somehow determined by another tensor T. For example, maybe loss is a sample from a negative binomial distribution, and T gives the alpha parameter. Or maybe loss is some complicated function of a sample from a negative binomial distribution where T gives the alpha parameter.
So if I call sess.run(loss) that will give me a sample from loss, which can be understood as an unbiased estimator for the expected value of loss.
If I call sess.run(tf.gradients(loss,T)) that will generally not be an unbiased estimator for the derivative of the expected value of loss with respect to T.

However, at least as of 2016 we now know how to write a general function surrogate(loss) that crawls the graph and automatically produces a tensor loss_surrogate so that

If I call sess.run(tf.gradients(surrogate(loss),T)) then I will get an unbiased estimator for the derivative of the expected value of loss with respect to T.

To work, the algorithm basically just needs to be able to compute the pmf of pdf for any op which is stochastic in a way that depends on its input. In most cases we can write any complicated random stuff in terms of compositions of simple distributions for which we know the likelihood, so this is no problem. The algorithm can then define a loss_surrogate tensor which will let you get estimators of the gradient of expected values. Note you don't have to know ahead of time what you might want to take the gradient with respect to.

It would be super nice to implement this surrogate function for tf. I think it would actually be fairly straightforward to implement, but we would definitely need community support to keep it maintained. We would need corner cases for random ops for which the density can't be written down. Moreover, anytime someone invents a new way of drawing randomness, we would need to think about how to make sure it plays nice with whatever surrogate(loss) function we might cook up.

davmre commented 6 years ago

Hi Jackson! This is a great request. Others might have more to say here, but I think the current state of things is that walking the TF graph is strongly discouraged, and code that tries to do so using unpublished APIs is subject to breakage without notice. The brittleness of graph-walking in Edward was a primary motivation for the development of Edward2, which uses its own tracing mechanism to avoid directly walking the TF graph.

Subject to this restriction, you might find that the expectation utility (https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/monte_carlo.py#L29) does some of what you're asking for: given an explicit source of randomness, it returns a Monte Carlo expectation with unbiased stochastic gradient, using the reparametrization or score-function estimators as appropriate. You can use this to effectively construct stochastic computation graphs, albeit in perhaps a slightly lower-level way than you're thinking of. We'd certainly be excited about designs to make this sort of functionality more cleanly accessible.

jacksonloper commented 6 years ago

Is there a white paper somewhere outlining to scope of the API edward2 is planning to compass? Or is the main idea just to rewrite Edward in a less brittle way?

davmre commented 6 years ago

@dustinvtran want to take this?

tensorflow / probability

feature request: gradients of expected values #37