probml / pml2-book

Probabilistic Machine Learning: Advanced Topics
MIT License
1.37k stars 118 forks source link

Exchangeability between derivatives and integrals in 6.3.3 SGD for optimizing the parameters of a distribution #259

Closed maremita closed 11 months ago

maremita commented 1 year ago

[Version: 2023-04-01]

In Section 6.3.3 (SGD for optimizing the parameters of a distribution, page 267), we want to compute the gradient of a stochastic objective with respect to the parameters of a distribution $q{\theta}(\mathbf{z})$. The objective function is written as an expectation wrt $q{\theta}(\mathbf{z})$. The gradient is given by:

\begin{aligned}
\nabla_{\mathbf{\theta}} \mathbb{E}_{q_{\mathbf{\theta}}(\mathbf{z})}\left[\tilde{\mathcal{L}}(\mathbf{\theta}, \mathbf{z}) \right] &= \nabla_{\mathbf{\theta}} \int  \tilde{\mathcal{L}}(\mathbf{\theta}, \mathbf{z}) q_{\mathbf{\theta}}(\mathbf{z}) \texttt{d}\mathbf{z} &\texttt{(6.53)}\\
&=\int \nabla_{\mathbf{\theta}} \left[  \tilde{\mathcal{L}}(\mathbf{\theta}, \mathbf{z}) q_{\mathbf{\theta}}(\mathbf{z}) \right] \texttt{d}\mathbf{z} &\texttt{(not shown in the book)}\\
&= \int \left[ \nabla_{\mathbf{\theta}} \tilde{\mathcal{L}}(\mathbf{\theta}, \mathbf{z}) \right] q_{\mathbf{\theta}}(\mathbf{z}) \texttt{d}\mathbf{z} + \int \tilde{\mathcal{L}}(\mathbf{\theta}, \mathbf{z}) \left[ \nabla_{\mathbf{\theta}}  q_{\mathbf{\theta}}(\mathbf{z}) \right] \texttt{d}\mathbf{z} &\texttt{(6.54)}
\end{aligned}

In the intermediate step, we exchange the gradient and the integral to get to 6.54 from 6.53. After, we use the product rule to compute $\nabla{\mathbf{\theta}} \left[ \tilde{\mathcal{L}}(\mathbf{\theta}, \mathbf{z}) q{\mathbf{\theta}}(\mathbf{z}) \right]$ and get the equation 6.54.

To support the exchangeability between derivatives and integrals, the BBVI paper mentioned the dominated convergence theorem. Some answers in forums support it by the Leibniz integral rule (is it related to the dominated convergence theorem?).

It seems that the exchangeability is not trivial. If yes, it will be fantastic if the book explains this intermediate step.

Thank you.

Best, Amine Remita

P.S. For those having an older version of the book, equations 6.53 and 6.54 correspond to 6.91 and 6.92, respectively, in section 6.5.2 Optimizing parameters of a distribution (page 273).

maremita commented 1 year ago

[Update]

In Monte Carlo Gradient Estimation in Machine Learning (Mohamed et al. 2020), the authors discussed the validity of the exchange of the order of the integral and the derivatives when developing the score function gradient estimator (REINFORCE):

4.3.1 Unbiasedness

When the interchange between differentiation and integration in (13a) is valid, we will obtain an unbiased estimator of the gradient (L’Ecuyer, 1995). Intuitively, since differentiation is a process of limits, the validity of the interchange will relate to the conditions for which it is possible to exchange limits and integrals, in such cases most often relying on the use of the dominated convergence theorem or the Leibniz integral rule (Flanders, 1973; Grimmett and Stirzaker, 2001). The interchange will be valid if the following conditions are satisfied:

  • The measure $p(x; \theta)$ is continuously differentiable in its parameters $\theta$.
  • The product $f(x)p(x; \theta)$ is both integrable and differentiable for all parameters $\theta$.
  • There exists an integrable function $g(x)$ such that $\sup{\theta} \lVert f(x)\nabla{\theta}p(x; \theta)\rVert_1 \leq g(x) \forall x$.

These assumptions usually hold in machine learning applications, since the probability distributions that appear most often in machine learning applications are smooth functions of their parameters. L’Ecuyer (1995) provides an in-depth discussion on the validity of interchanging integration and differentiation, and also develops additional tools to check if they are satisfied.

murphyk commented 11 months ago

Thanks for raising this. I will add a comment about this assumption, and cite that paper.