Implicit reparameterization gradients

dustinvtran commented 6 years ago

The idea is simple to implement and well-scoped as part of TF Distributions. I personally like the idea of even having it be the default for Gamma, Beta, and others. The gradient implementations may be tricky. Maybe @mfigurnov has an implementation we can build on? (if open source or google-internal)

axch commented 6 years ago

Wow, that's a really nice trick -- I wish I had thought of it! Re: gradient implementations: The paper discusses deriving them from the CDF using forward-mode AD. That (or reverse-mode) could be a good place to start on CPU, especially if TF is willing to incorporate a C++ AD system (the Stan math library, perhaps?) as a dependency. Is CUDA close enough to C++ that the same AD system could just work?

The bad news is that any AD approach will likely have to accept some error, because differentiating a numeric approximation is worse than numerically approximating a (correct) derivative. The good news is that for basic distributions like Gamma, unit-testing the derivative code is pretty easy: just test against an expensive, high-precision numerical method (like Richardson extrapolation in 64-bit or 128-bit floats).

Hm. I wonder whether it would be possible to construct a better (and still fast) approximation of the true derivative by using the expensive-but-accurate computation as a reference. For instance, can an optimal set of Chebyshev polynomials up to, say, 5th order be found empirically? One could even use the AD as a starting point, and find Chebyshev polynomials to cancel off the difference between the AD results and the accurate results.

davmre commented 6 years ago

I imagine that getting a full-fledged C++ AD system into TF would be quite a heavy lift, but since this approach isn't AD'ing user code, just a small number of well-understood computations (e.g., the gamma CDF), couldn't one just do the AD manually? That is, insofar as forward-mode AD is 'just' running your computation on dual numbers x+eps instead of x, for any given algorithm you should be able to just explicitly do this, grounding out the real and infinitesimal parts in your code.

axch commented 6 years ago

Yes. The question in my mind is whether the computations in question really are few in number and well-understood; but one could certainly start down that path and see how much use one covered before getting tired.

srvasude commented 6 years ago

Right, I think we can override the gradient computations with this explicitly.

The idea is that (talking with @jvdillon) is that distribution can have an override gradient method for samples. That way a user can switch with different ways to override the gradient (and for non reparameterizable distributions, we would error out explicitly.

Re: How would you use a Chebyshev approx here? Are you saying for univariate distributions, use the Inverse CDF transform (support is [0, 1]) and then do some Chebyshev approximation to that (rather than doing implicit differentiation with the CDF)? At some point you'd have to compute the Inverse CDF of your random draw right (which can be expensive)?

axch commented 6 years ago

No, I was just musing about how to get better precision for the derivative of the CDF than you would get by ADing the implementation. Namely, if you have a slow but accurate way to compute some analytically obnoxious function (in this case, f(x) = d/dx CDF_foo(x)|_x), and you have a fast but somewhat inaccurate way to compute it (in this case, applying AD to CDF_foo), is it possible to empirically select coefficients to construct a Chebyshev approximation of the difference, so as to ship

def f(x):
  return AD(CDF_foo)(x) + <some polynomial>(x)

The idea would be to pick the polynomial to minimize the maximum (empirically determined) error between this (which is presumably still fast) and the slow but accurate thing. It's not clear that this idea is workable, and it certainly isn't necessary to get started, but it seemed cute.

To clarify: the slow and accurate thing I have in mind is numerically computing d/dx CDF_foo(x)|_x by Richardson extrapolation in high precision arithmetic (either 64-bit, or, if that's not good enough, busting out a suitable very high precision library).

mfigurnov commented 6 years ago

Thanks for the interest in the work! Hopefully, this will be incorporated into TFP soon :) Let me comment on some of the points. Tl;dr take a look at the appendix, it answers some of your questions.

Regarding an auto-diff system: it's an overkill to bring one just for this feature. The CDF for Gamma is computed using basic arithmetics and Gamma function (see appendix B). So manual forward-mode differentiation is somewhat annoying, but very much doable. This is what we did, basically (except for small tweaks described in appendix B).

Regarding computing the derivative directly. There is a closed-form expression in appendix C for that. However, it involves a hypergeometric function that is tricky to compute. We use it to obtain the ground-truth values.

Regarding the error of derivative: the error of the derivative is close to the error of CDF, so there doesn't seem to be an error accumulation going on.

srvasude commented 6 years ago

@axch: Understood, thanks! Yeah that's a fairly interesting approach (although as you said, it is not clear to me that this completely workable).

mfigurnov commented 6 years ago

Also, take a look at the numbers in table 2. For Gamma, the average derivative dz/dalpha is one (see appendix A), so you cannot really expect the error to go below 1e-7 for float32 and 1e-16 for float64. So the correction term is an interesting idea, but might be unnecessary for practical purposes.

yorkerlin commented 6 years ago

@mfigurnov Could you comment on the difference between the method in Eq. 27 at page 18 of https://arxiv.org/pdf/1206.6679.pdf and your method? Any comments on sparse Gamma distributions when \alpha<1e-6? (see https://arxiv.org/pdf/1509.01631.pdf)

yorkerlin commented 6 years ago

@mfigurnov Any comments on the Generalized inverse Gaussian (GIG) distribution? It seems it is non-trivial to compute the CDF for GIG.

mfigurnov commented 6 years ago

@yorkerlin

Could you comment on the difference between the method in Eq. 27 at page 18 of https://arxiv.org/pdf/1206.6679.pdf and your method?

Thanks for the interest in our work. The difference compared to Salimans&Knowles is that we (i) propose using automatic differentiation to compute the derivative of the CDF and experimentally evaluate the effectiveness of this approach; (ii) present a more general form of this implicit gradient (using standardization functions instead of just the CDF, and also provide an expression for the multivariate case); (iii) highlight the connection between the implicit derivative and the standard reparameterization trick. We also recently became aware that a related approach was developed by Hoffman&Blei "Structured Stochastic Variational Inference". Additionally, we found a 1988 paper by Suri&Zazanis on perturbation analysis https://www.jstor.org/stable/2632185 that proposes both the explicit reparameterization and the eqn. (27) of Salimans&Knowles in Section 4, so the idea has been known for a lot longer than we thought :) We are currently updating the related work section to include and discuss these works.

Any comments on sparse Gamma distributions when \alpha<1e-6? (see https://arxiv.org/pdf/1509.01631.pdf)

The error of our method is actually better for extremely low \alpha than for larger ones. So we do not use any additional approximations.

Any comments on the Generalized inverse Gaussian (GIG) distribution? It seems it is non-trivial to compute the CDF for GIG.

There is a method for approximation of the CDF of this distribution: http://epub.wu.ac.at/1548/ . Since it uses a polynomial approximation, it should be straightforward to differentiate. I am not sure how practical this is, though, since the method seems to have a slow set-up. (Non-generalized) inverse Gaussian Distribution has tractable CDF and tractable derivative of the CDF, so you could definitely use implicit gradients for that distribution.

yorkerlin commented 6 years ago

@mfigurnov Thanks for your message. Nice work :)

About the multivariate case It seems that it is non-trivial to compute the gradient w.r.t. the covariance matrix for a finite mixture of multivariate Gaussians with full correlation if the multivariate CDF transform is used. It is relatively easy to compute the gradient w.r.t. mixing weights in this case. Do you have any experimental result on this?

fritzo commented 6 years ago

In PyTorch we followed an approach similar to what @axch suggests by constructing a minimax numerical approximation to the gradient directly (rather than combining multiple approximations via AD). Here's our write-up https://arxiv.org/abs/1806.01851 and derivations for Gamma and Beta/Dirichlet, in case it helps. IIRC our Gamma doesn't work well for alpha<1e-3 due to loss of precision in the sampler. cc @martinjankowiak

yorkerlin commented 6 years ago

@fritzo nice work! For multivariate Gaussian, it will be great if you can compare your method against the following method? Let q(x)=N(x|\mu,V), where C is the (lower triangular) Choleksy factor of the co-variance matrix V. (C*C^T=V) The gradient w.r.t. C can be computed as below dC = tril(dV*(2*C)), where dV=\nabla_{V} E_{q(x)} {f(x)} can be computed using Eq. 19 at Opper&Archambeau http://www0.cs.ucl.ac.uk/staff/c.archambeau/publ/neco_mo09_web.pdf

For Gaussian model, we use the mentioned method in our paper (see Figure 2 of https://arxiv.org/pdf/1511.00146.pdf) For Gamma model, we use the method proposed by Knowles in another paper. (see Figure 1 of https://arxiv.org/pdf/1703.04265.pdf) I think your proposed method could improve the performance.

yorkerlin commented 6 years ago

@fritzo The mentioned method uses the second order information of the target function f(x). It may not be fair to compare against your method. Note: the REINFORCE method uses the zero order information of f(x) while the re-parametrization method uses the first order information of f(x).

mfigurnov commented 6 years ago

@yorkerlin We don't have results for mixtures of full-covariance multivariate Normals, unfortunately. One way to obtain gradients w.r.t. the covariance matrix could be ancestral sampling: you sample the point from one of the mixture components, and backpropagate to just this component's mean/covariance. I am fairly sure that this estimator is going to be unbiased, but obviously higher variance than if you properly accounted for the posterior mixture probabilities.

martinjankowiak commented 6 years ago

@yorkerlin no, we didn't compare to that (second-order) approach. for one thing because computing hessians in high dimensions can be prohibitively expensive. of course, you can approximate the hessian, in which a natural thing to do is follow the approach taken in https://arxiv.org/abs/1705.07880. cc @fritzo

yorkerlin commented 6 years ago

@martinjankowiak For high dimensional problems, diagonal Gaussian distribution could be used. It is relatively easy to compute the diagonal elements of the Hessian matrix.

mfigurnov commented 6 years ago

Gamma, Beta, Dirichlet, Student's t and inverse Gamma distributions are now fully reparameterized, see the following commits: TensorFlow (https://github.com/tensorflow/tensorflow/commit/b894f6844a0df2b17d5f85f9c5a52f3747a3bec8), TFP (https://github.com/tensorflow/probability/commit/633e4f208c997c0eb8e4bfda8e03db83a06f972d)

The numerical method for the derivative of Gamma samples is incorporated into Eigen: https://bitbucket.org/eigen/eigen/pull-requests/403/derivative-of-the-incomplete-gamma/

Thanks to the TF team for their help! :)

dustinvtran commented 6 years ago

Thanks @mfigurnov! Fantastic work. Closing this issue.

jvdillon commented 6 years ago

What about truncated_normal?

dustinvtran commented 6 years ago

There are many that can be interpreted as being part of this broad issue. It's probably worth having the many other extensions in separate issues to measure progress. Thoughts?

srvasude commented 6 years ago

I'm agreed with Dustin on this. I read this bug as using the trick for some fairly common distributions (Gamma, StudentT, Beta, etc., i.e. what the paper addresses). Closing this bug, and will open new bugs on a case-to-case basis.

yorkerlin commented 5 years ago

You may look at our poser for the ICML workshop on Stein's method, where we weaken the smoothness assumption of the Implicit reparameterization gradients. In our poster, we only focus on the exponential family. However, the idea can be readily extended to general continuous univariate distribution. For multivariate case, you can use Theorem 4 in our poster.

TL;DR: The target function can be locally Lipchitz instead of continuously differentiable. For example, Relu is not continuously differentiable.

yorkerlin commented 5 years ago

@jvdillon For truncated_normal, if the truncated interval is fixed, you can still use that trick but you have to ensure that the boundary condition holds (see Lemma 3 in our poster).

brianwa84 commented 4 years ago

Move the sample inside the tape?

Brian Patton | Software Engineer | bjp@google.com

On Wed, Aug 12, 2020 at 6:20 AM Mossi8 notifications@github.com wrote:

Hi ,

Backpropagation is not working for gamma distribution. Has there been any changes?

concentration = tf.constant(3.0) rate = tf.constant(2.0) dist = tfp.distributions.Gamma(concentration, rate) samples = dist.sample(5) # Shape [5] with tf.GradientTape() as tape: loss = tf.reduce_mean(tf.square(samples)) # Arbitrary loss function Unbiased stochastic gradients of the loss function

grads = tape.gradient(loss, [concentration, rate])

print(grads)

[None, None]

Thanks!!!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/51#issuecomment-672787061, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFSI5MEXF26QSJMF6XNZLSAJUG3ANCNFSM4FBLLERQ .

Mossi8 commented 4 years ago

Move the sample inside the tape? Brian Patton | Software Engineer | bjp@google.com … On Wed, Aug 12, 2020 at 6:20 AM Mossi8 @.***> wrote: Hi , Backpropagation is not working for gamma distribution. Has there been any changes? concentration = tf.constant(3.0) rate = tf.constant(2.0) dist = tfp.distributions.Gamma(concentration, rate) samples = dist.sample(5) # Shape [5] with tf.GradientTape() as tape: loss = tf.reduce_mean(tf.square(samples)) # Arbitrary loss function Unbiased stochastic gradients of the loss function grads = tape.gradient(loss, [concentration, rate]) print(grads) [None, None] Thanks!!! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFSI5MEXF26QSJMF6XNZLSAJUG3ANCNFSM4FBLLERQ .

Hi, It still give None.

This comes from a model I built which samples from a gamma distribution: tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample()

But, I get this error

LookupError: No gradient defined for operation 'gradients/Gamma_1/sample/IdentityN_grad/RandomGammaGrad' (op type: RandomGammaGrad)

However if I sample from a Normal it works just fine.

Thank you!!!

brianwa84 commented 4 years ago

It looks like you are taking a second-order gradient there. That's not supported. First order appears to work fine though, maybe you're missing tape.watch? https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb

jvdillon commented 4 years ago

Could we fire an assertion if gradient is called through custom grad, eg, by adding a customgrad to the customgrad which throws an exception?

On Thu, Aug 13, 2020 at 6:57 AM Brian Patton notifications@github.com wrote:

It looks like you are taking a second-order gradient there. That's not supported. First order appears to work fine though, maybe you're missing tape.watch?

https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/51#issuecomment-673494767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIVTNWP65DBBSL3QEKUMIDSAPWNTANCNFSM4FBLLERQ .

brianwa84 commented 4 years ago

The assertion we would fire would look almost identical to the one here, i.e. see the behavior in MixtureSameFamily to disable second order grads. It raises a LookupError. Sure, we could give a better explanation, but I don't think the explanation here is too bad.

Brian Patton | Software Engineer | bjp@google.com

On Mon, Aug 17, 2020 at 4:02 PM Joshua V. Dillon notifications@github.com wrote:

Could we fire an assertion if gradient is called through custom grad, eg, by adding a customgrad to the customgrad which throws an exception?

On Thu, Aug 13, 2020 at 6:57 AM Brian Patton notifications@github.com wrote:

It looks like you are taking a second-order gradient there. That's not supported. First order appears to work fine though, maybe you're missing tape.watch?

https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/tensorflow/probability/issues/51#issuecomment-673494767 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAIVTNWP65DBBSL3QEKUMIDSAPWNTANCNFSM4FBLLERQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/51#issuecomment-675083547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFSI6R4Z7UZCAB27UVPZDSBGED5ANCNFSM4FBLLERQ .

Mossi8 commented 4 years ago

https://colab.research.google.com/gist/brianwa84/46e9c880245d2fe1905cc782a68be79c/gamma.ipynb

Hi, it is solved now. You were right, I had a persistent=True somewhere, and it was taking the second second order derivatives with gamma reparameterization. And, I was not familiar with the LookupError.

Thank you so much!

danleonte commented 3 years ago

Hi

Is there a way to use the current implementation of the reparametrization trick in

tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample()

for a truncated gamma distribution? Would throwing away samples that are too big work (even if is wasteful)?

brianwa84 commented 3 years ago

Gamma recently got a differentiable quantile function, so you could probably use the approach outlined in #1135.

On Wed, Feb 17, 2021 at 1:42 PM Dan Leonte notifications@github.com wrote:

Hi

Is there a way to use the current implementation of the reparametrization trick in

tfp.distributions.Gamma(concentration=mean,rate=logvar,name='Gamma').sample()

for a truncated gamma distribution? I guess throwing away samples that are too big is one way, but it is rather wasteful.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/probability/issues/51#issuecomment-780766469, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJFSIZHENDUOYIOM2OCIZTS7QEZHANCNFSM4FBLLERQ .

tensorflow / probability

Implicit reparameterization gradients #51