rdevon / DIM

Deep InfoMax (DIM), or "Learning Deep Representations by Mutual Information Estimation and Maximization"
BSD 3-Clause "New" or "Revised" License
806 stars 102 forks source link

Samples from "Training a generator by matching to a prior implicitly" ? #3

Closed bojone closed 6 years ago

bojone commented 6 years ago

I want to see some random samples from your new generative models, especially which are trained on CelebaA or LSUN datasets. The random samples trained on Tiny-ImageNet seen not good actually~

rdevon commented 6 years ago

So for unconditional Tiny Imagenet, so far I haven't seen anyone do better.

Funny, CelebA and LSUN samples seemed to have missed the pdf, and we'll update next opportunity. https://imgur.com/a/4qTcghG

bojone commented 6 years ago

wonderful! Is there any released code about it?

In my opinion, its success seems to be an incredible thing. It seems to construct generator G by z = E(G(z)). Can you discuss more about this new generative way ?

rdevon commented 6 years ago

Since you're asking I'll add this code soon.

So the main motivation behind the generative model was that good discriminators tend to have these histograms that resemble 1-d gaussians with some overlap. In some sense this makes sense, as we want the discriminator-defined features (hence the scores) between the real and fake distributions to overlap in order to ensure meaningful gradients for the generator. While this hypothesis is validated by our experiments, if I were to do it again, I'd try to make a more direct connection to noise contrastive estimation (Gutmann 2010).

bojone commented 6 years ago

let x be one sample(image). d is the size of x

how about if the encoder just use part of x? maybe z = E(x) = E(x[: d/2]). the weights of x[ d/2:] are zeros.

because the demension of z is less than x a lot, it is possible. in other word, we want to encode X into N(0, 1), but the encoder just encoded X[: d/2] into N(0, 1).

in this case, E(x) does not dependent on total x, so constructing generator G by z = E(G(z)), G(z) not need to generate a total sample(image).

this is what confused me.

bojone commented 6 years ago

by the way, actually, the JS divergence estimation in f-GAN is equivalent to noise contrastive estimation

rdevon commented 6 years ago

Re JSD and NCE: yes! You are quite correct, and we actually have already added this insight to our recent version of the paper.

As far as your example, what incentive does the encoder / discriminator have to zero out the weights corresponding to half of the image? In this case, the encoder is just trying to act like a conditional generator with two gaussian targets, and it's not trained adversarially w.r.t. the image generator. The encoder objective is closer to entropy maximization, but in a way such that it must use similar features between the conditioning variable (real / fake).

I agree that the encoder doesn't need to behave well w.r.t. the input space or any structure of the input for it to fit to two gaussians. In theory a single pixel could suffice. So that this works might be the result of the inductive bias in the encoder and the bi-modal objective. I think it might be worth looking a little more closer on how the generator might encourage the discriminator to learn to encode meaningful structure of the input in this case, and I think NCE might provide a nice framework for this analysis (e.g., thinking of the generator as an adaptive negative sampling distribution).

On that note, re the connection of JSD and NCE, they should have called GAN deep adaptive noise contrastive estimation (DANCE).

bojone commented 6 years ago

I think I need more time to understand it.

I am looking forward to see your further description about it. Maybe you need to incorporate it into a probability framework.

Anyway, I think it is a good and potential idea.