Open pierremac opened 6 years ago
Hi!
Thanks for the questions.
Regarding "implicit encoder" --- this was written for some experiments we did with stochastic encoders. However, eventually all our attempts to train WAE with stochastic encoders ended up with deterministic encoders, i.e. the encoders preferred to reduce their variance to zero. This was partially reported in one of our follow up papers (Paul Rubenstein, Ilya Tolstikhin, On the latent space of Wasserstein autoencoders). The question of how to train stochastic encoders with WAE and preserve the noise structure is an interesting open problem.
Regarding your first question, indeed, we loose any sort of guarantees by relaxing the equation constraint with the penalty. There is a paper called "Sinkhorn autoencoders" which shows that, roughly, using a simple triangle inequality you can prove that the WAE with a Wasserstein divergence in the latent space provides an upper bound on the original transport distance in the input space. I don't know any other results on that topic. Would be interesting to find out more about it!
Best wishes, Ilya
Thank you for the very fast reply, Ilya! Looks like I had missed the bit about the implicit encoder in the "On the latent space of WAE" paper, my bad. Thanks for the pointer! I guess it kind of makes sense that the stochastic encoders would converge to deterministic ones. I'm also curious about how it goes along the training. Does it change its dynamics? Does it lead to some faster or more stable training maybe?
For the second part, thanks for the pointer. That doesn't describe what's happening super accurately but at least, this triangle inequality is a very good start!
And also thank you (and your co-authors) for writing such a beautiful paper and those smaller follow-ups that really have me feel smarter after I read them. :)
Indeed, what we observed in the Paul Rubenstein's paper is that even though the stochastic encoders decide to drop the variance (i.e. converge to the deterministic ones), these resulting deterministic encoders are different from those you would obtain by training plain deterministic ones. I think this is an interesting topic to look into..
Thank you very much for your kind words! And good luck with your research as well!
Hello,
Correct me if I'm wrong but my understanding is that the simplified expression of the Wasserstein distance, obtained in theorem 1 relies heavily on the hypothesis that the latent codes distribution matches exactly the prior. But with the necessary relaxation on this constraint, the hypothesis doesn't hold. Do you have any sense of what is happening when the constraint is "violated too much" (e.g. lambda is too small...)? I haven't had time to run an empirical study and can't wrap my head around what it implies "theoretically". Any insight to share?
Also, in your implementation, I notice there is an "implicit" model of noise for the encoder. I understand that the noise is parameterized by a neural network that is learnt along in a training of the WAE but can you give a bit more of an insight about it? I can't find any reference to it in the WAE paper or any of the follow-ups I know. Any pointer?
Thanks.