mattjj / svae

code for Structured Variational Autoencoders
349 stars 88 forks source link

KL(q(x)||P(x|theta)) make things worse? #4

Closed Nat-D closed 8 years ago

Nat-D commented 8 years ago

Hi Matt,

I'm not sure if this is right place to ask question about your paper? But anyway, I implemented a version of latent Gaussain Mixture model (normal-gamma prior for the gaussian). I found that the gradient from the KL loss term ( E_q[ KL(q(x)||P(x|theta))] ) make the result worse. Well, it makes the latent space looks like a gaussian and the generator network wouldn't be able to learn to reconstruct at all. I manage to make it works sometime but all of the time without the KL loss. What I am wondering is, is this the behaviour you see on your experiments?

Sub-question, I'm using learning rate around 0.1- 0.2 for updating global parameters with natural gradient and I'm using 0.001 for the neural network recogniser and generator...Do you optimise the two parts seperately or together? It seems like in the paper the theory suggest same learning rate but I'm not sure. Sorry, I should have read through your code to answer these question but I'm a little bit clueless reading code in general.

my code if anyone interested https://github.com/Nat-D/SVAE-Torch

Thanks alot in advance, Nat

mattjj commented 8 years ago

Thanks for your interest, and for trying to replicate stuff! This seems like a good place to ask questions.

The step sizes can be different, and I think it makes sense for the natural gradient step sizes to be bigger because they're corrected for a kind of curvature in the parameterization. That's why this code has two different step sizes for the two coordinate blocks. Having the step sizes be different for different coordinates is exactly what things like AdaGrad and Adam try to learn. In general, it's common to do optimize using "gradient-related" updates, meaning update steps that have positive inner product with the gradient (or equivalently update steps that can be written as some positive definite matrix multiplying the gradient).

No, that's not the behavior I've seen with the GMM. When you say the gradient from the prior term (i.e. the KL term) made things worse, do you mean it made the objective value worse, or it just made your results worse (e.g. generative samples or reconstruction samples)?

If the objective gets worse, then there must be a bug. If the results just look worse somehow but the objective is going in the correct direction, then maybe the prior is too strong. I did all of the experiments including the prior term (i.e. the KL term).

There is a bug in this python implementation: the natural gradient computed here only includes the first of two terms for the natural gradient, i.e. it's missing the second term in Eq. (4) of the paper. I found my experiments (including the GMM) worked fine without that term (i.e. the latent variable model still fit well, since this term doesn't affect the generator network or recognition network gradients), but if your objective value is getting worse then it could be the culprit. It's easy to compute using autograd but I haven't had a chance to fix this implementation yet. (I have some other things to clean up here, too, but I had a busy summer of job-hunting and just started a new job this week.)

Does your code work if you set the latent variable model to be a standard Gaussian, i.e. p(x) = N(x | 0, I) ? That should recover a standard variational autoencoder, and would check whether there are any bugs in training the generator/recognition networks.

Nat-D commented 8 years ago

Hi Matt,

Thank you for your reply. I think you are right about the prior being too strong. After setting the prior parameter of the precision of the the mean to be much smaller, the reconstruction now looks okay. However, the GMM seem to be conservative and doesn't cluster the data into group as I was expecting. I think I will need to find the right set of prior parameters to make it work.

screen shot 2016-09-19 at 16 06 50

Figure 1.) Result with KL-Loss; Left is Latent space with GMM, Right is the data (black) and the reconstruction

screen shot 2016-09-19 at 16 07 10

Figure 2.) Result without KL-Loss, Latent space look more "unstructured" but the clustering seem to be okay

Do you set static prior or are you optimising them as well? It seems that these parameters are crucial for learning. I might try to optimise them somehow.

Thanks again, Nat

mattjj commented 8 years ago

I'm fixing the prior, meaning I have some fixed normal-inverse-Wishart parameters (corresponding to fixed normal-Gamma parameters) and Dirichlet concentration parameters. I'm also using resnets for the encoder and decoder and initializing so that the transformation starts out very close to the identity. Finally, I'm initializing the clusters with diversity in their means and with reasonable scales. I think you're right that the prior's hyperparameters (and the initialization) can be important.

Clustering in the latent space should work much better than in your Fig 1, so I think there's still a problem.

The GMM example in experiments/gmm_svae_synth.py is working in the fixing-things branch. Here's what the initialization looks like, with the data space on the left and latent space on the right:

image

You can see that I'm giving the model 15 mixture components but a sparsifying Dirichlet prior. Giving the model more components than necessary helps a lot, since it only needs a few clusters to be initialized decently and then it can kill off the others.

Here's after about 50 updates (50 data points per minibatch, 1 Monte Carlo sample, step sizes 10 and 1e-2, 100 data points per cluster in the dataset):

image

And here's after about 250:

image

And after about 550:

image

It'll slowly keep straightening things out in the latent space; setting the step sizes and minibatch sizes better can probably make that happen faster.

What optimizer are you using? I'm using Adam on the neural net components of the parameter vector.

mattjj commented 8 years ago

Keep in mind those "level set" plots I'm showing don't actually represent the density in the data space very well (i.e. on the left side). The hex plots we used in the paper are much better for that, since they actually track differential volume rather than just a single line, but these plots are much faster to generate.

mattjj commented 8 years ago

I'm refreshing my memory about some of this code. I made the same plots using experiments/gmm_svae_synth_plot_pdfs.py just for better visualization. Here's after about 750 iters in the latent space:

image

and data space:

image

I'd like to figure out why this example is working better than in your torch code. I don't think removing the KL term is the right solution, though maybe it could provide a good initialization (i.e. fit a vanilla autoencoder first before adding the variational Bayesian part in).

Nat-D commented 8 years ago

I see. The choice of resnet and identity initialisation is really smart. with this choice I think it makes your training dynamics really stable and what the algorithm need to do is just slowly fine tune the network to match the prior where the GMM can learn with quite stable potential.

I will try to use resnet and the initialisation and hopefully the result might be a bit more similar. I used normal MLP with default Torch weight initialisation for the results on my previous post and the training dynamics is quite unstable, it 'blows up' time to time before it settles.

I don't know much about optimization algorithms but I used adam for neural network and sgd for updating global variational parameters which seems to work better than using adam for the global variational parameters.

Nat-D commented 8 years ago

screen shot 2016-09-20 at 19 54 30

Works like magic! Thanks alot Matt.

Nat-D commented 8 years ago

I have another question tho? Should I make a new github issue? It is about how to deal with latent with lower dimension since resnet block can only deal with same size input and output?

duvenaud commented 8 years ago

To deal with different-sized latent and data spaces, we made the "res net" be:

y = f(x) +Ax

where A is a matrix. So the model is just a neural net plus linear regression. The idea is that even if the linear part is initialized randomly, it'll still learn much quicker than the neural net part.

Nat-D commented 8 years ago

Ah neat! Thank guys. Awesome work by the way.

mattjj commented 8 years ago

Thanks, @Nat-D! We're really excited that you dug in and wrote your own implementation.