Gradient Estimation Using Stochastic Computation Graphs

tomlepaine commented 8 years ago

A rich class of problems arising throughout machine learning requires optimizing loss functions that involve an expectation over random variables. Two broad categories of these problems are (1) likelihood maximization in probabilistic models with latent variables [17, 18], and (2) policy gradients in reinforcement learning [5, 23, 26]. Combining ideas from from those two perennial topics, recent models of attention [15] and memory [29] have used networks that involve a combination of stochastic and deterministic operations.

R. M. Neal. Learning stochastic feedforward networks. Department of Computer Science, University of Toronto, 1990.

R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.

V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pages 2204–2212, 2014.

W. Zaremba and I. Sutskever. Reinforcement learning neural Turing machines. arXiv preprint arXiv:1505.00521, 2015.

tomlepaine commented 8 years ago

6 Related Work

As discussed in Section 2, the score function and pathwise derivative estimators have been used in a variety of different fields, under different names. See [3] for a review of gradient estimation, mostly from the simulation optimization literature. Glasserman’s textbook provides an extensive treatment of various gradient estimators and Monte Carlo estimators in general. Griewank and Walther’s textbook [8] is a comprehensive reference on computation graphs and automatic differentiation (of deterministic programs.) The notation and nomenclature we use is inspired by Bayes nets and influence diagrams [19]. (In fact, a stochastic computation graph is a type of Bayes network; where the deterministic nodes correspond to degenerate probability distributions.) The topic of gradient estimation has drawn significant recent interest in machine learning. Gradients for networks with stochastic units was investigated in Bengio et al. [2], though they are concerned with differentiating through individual units and layers; not how to deal with arbitrarily structured models and loss functions. Kingma and Welling [11] consider a similar framework, although only with continuous latent variables, and point out that reparameterization can be used to to convert hierarchical Bayesian models into neural networks, which can then be trained by backpropagation. The score function method is used to perform variational inference in general models (in the context of probabilistic programming) in Wingate and Weber [27], and similarly in Ranganath et al. [20]; both papers mostly focus on mean-field approximations without amortized inference. It is used to train generative models using neural networks with discrete stochastic units in Mnih and Gregor [14] and Gregor et al. in [7]; both amortize inference by using an inference network. Generative models with continuous valued latent variables networks are trained (again using an inference network) with the reparametrization method by Rezende, Mohamed, and Wierstra [21] and by Kingma and Welling [10]. Rezende et al. also provide a detailed discussion of reparameterization, including a discussion comparing the variance of the SF and PD estimators. Bengio, Leonard, and Courville [2] have recently written a paper about gradient estimation in neural networks with stochastic units or non-differentiable activation functions—including Monte Carlo estimators and heuristic approximations. The notion that policy gradients can be computed in multiple ways was pointed out in early work on policy gradients by Williams [26]. However, all of this prior work deals with specific structures of the stochastic computation graph and does not address the general case.

Y. Bengio, N. Leonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

D. P. Kingma and M. Welling. Efficient gradient-based inference through transformations between bayesnets and neural nets. arXiv preprint arXiv:1402.0480, 2014.

A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. arXiv:1402.0030, 2014.

K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. arXiv preprint arXiv:1310.8499, 2013.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082, 2014.

D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv:1312.6114, 2013.

tomlepaine commented 8 years ago

The contributions of this work are as follows: • We introduce a formalism of stochastic computation graphs, and in this general setting, we derive unbiased estimators for the gradient of the expected loss. • We show how this estimator can be computed as the gradient of a certain differentiable function (which we call the surrogate loss), hence, it can be computed efficiently using the backpropagation algorithm. This observation enables a practitioner to write an efficient implementation using automatic differentiation software. • We describe variance reduction techniques that can be applied to the setting of stochastic computation graphs, generalizing prior work from reinforcement learning and variational inference. • We briefly describe how to generalize some other optimization techniques to this setting: majorization-minimization algorithms, by constructing an expression that bounds the loss function; and quasi-Newton / Hessian-free methods [13], by computing estimates of Hessian-vector products.

The main practical result of this article is that to compute the gradient estimator, one just needs to make a simple modification to the backpropagation algorithm, where extra gradient signals are introduced at the stochastic nodes. Equivalently, the resulting algorithm is just the backpropagation algorithm, applied to the surrogate loss function, which has extra terms introduced at the stochastic nodes. The modified backpropagation algorithm is presented in Section 5.

Derive unbiased estimator, and can be expressed as the gradient of a surrogate loss.

Describe variance reduction techniques.

tomlepaine commented 8 years ago

Nomenclature

The methods of estimating gradients of expectations have been independently proposed in several different fields, which use differing terminology.

Score function estimator [3]: likelihood ratio estimator [5], REINFORCE [26].
Pathwise derivative estimator [4, 16]: infinitesimal perturbation analysis, stochastic backpropagation [21].

tomlepaine / paper-notes

Gradient Estimation Using Stochastic Computation Graphs #40

6 Related Work