ritheshkumar95 / energy_based_generative_models

PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models
147 stars 24 forks source link

why not use KL divergence to estimate mutual information #2

Open bojone opened 5 years ago

bojone commented 5 years ago

As we know I(X,Z) = KL(p(x,z)||p(x)p(z)). So why do you estimate mutual information by JSD rather than KL maximization?

f-GAN also gives us KL(p(x)||q(x)) = maxT E{x~p(x)} [T(x)] - E_{x\sim q(x)} [e^{T(x)-1}] I think it is a more natural and more reasonable choice than JSD ?

ritheshkumar95 commented 5 years ago

Of the different divergences that you can use using the f-gan formulation, JSD worked better because it's bounded. KL is unbounded and should not be used to perform any MI maximization. (Read MINE, DeepInfoMax papers)

bojone commented 5 years ago

I known what MINE or DeepInfoMax do in their own paper. The problem is, if your transform H(X) = I(X,Z) is right, then I(X,Z) will be limited, so does KL(p(x,z)||p(x)p(z)) and maxT E{x,z~p(x,z)} [T(x,z)] - E_{x,z\sim p(x)p(z)} [e^{T(x,z)-1}].

Actually I have tried it and it also works (no NAN loss). So "KL is unbounded ==> should not be used to perform any MI maximization" is not absolutely right.

KL is derivated from equation and JSD is a bounded analogue of KL. Therefore, if they both works and JSD works really better, we have to explained why.

ritheshkumar95 commented 5 years ago

If you look at the appendix in the MINE paper, it says that maximizing MI between input and output of the GAN generator works only if they perform adaptive gradient clipping. They too noticed that it explodes because the second log sum exp term in their formulation will explode early in training. (We discussed with the MINE authors as well about this)

So we chose to use JSD because it simply was more stable in practice. Alternatively you can use the NCE version of MI maximization as well.

bojone commented 5 years ago

Let us compare your new paper with deepinfomax.

In deepinfomax, the main target is to extract good feature by maximizing MI. And we know MI = KL(p(x,z)||p(x)p(z)) is actually a divergence between p(x,z) and p(x)p(z). So we can replace KL with JSD because JSD is just another divergence.

But in this paper, your goal is to estimate energy function. Energy function has an accurate and quantitative definition. And your derivation is that generator loss is -I(X,Z)+E(X), which is also an accurate and quantitative definition.

Now the question is JSD is not an approximation of KL. In other word, we has no something like KL = JSD + O(p(x,z)). So if we want to replace KL with JSD and combine it with E(X), we need more reasons to explain why it works. Numerical stability is just its advantage, not rationality.

ritheshkumar95 commented 5 years ago

Well, the JSD or even MI doesn't come out of nowhere. We want to minimize KL(P_G || P_E) since we're training a generator to approximate the energy function for efficient sampling.

Entropy of PG and E(x{x \sim p_G}) arise from the expansion of that formula. In our case, MI = Entropy since the transformation is deterministic. Like i've mentioned several times, using JSD instead of KL is purely from simplicity of training. There isn't any obvious derivation from the theory.

1032864600 commented 4 years ago

HI,recently,i want to use MINE to estimate the KL between Gaussian distribution and Laplace distribution。I have noted that how to input when i want to estimate KL(P(XY)||P(X)P(Y))。But I dont know how to input when i want to estimate KL(P(X)||P(Y)),i chosen Gaussian distribution and Laplace distribution,just got N samples from them,do i need to shuffle their sample values?