Open pierremac opened 6 years ago
I think the linear layer refers to ψ(.) in eq. (6) in the paper.
You are right. Thank you!
I have to say that it's been working really well for me, without that linear layer. Of course, it doesn't match the "theory" behind it anymore, even though it functionally makes sense too.
@pierremac I wonder if you have tried any normalization tricks without conditioning? Currently I'm having trouble training WGAN-GP w/ projection discriminator. The training loss (discriminator loss + GP term) explodes every so often (about every 1-3k iterations), and I'm trying to find the reason. One of the things that are different between my implementation and the author's is normalization tricks. I have tried batch norm, layer norm and instance norm, all of which having the same loss exploding problem. Have you ever experienced the same problem? You said "...regardless of whether I use conditional batch norm, conditional layer norm or more vanilla concatenation of the conditions in the generator." I assume you have tried many conditional normalization tricks. Have you tried unconditional normalization tricks?
Yes, I trained both conditional (with ACGAN and projection) and unconditional ("vanilla" WGAN-GP). I get similar results no matter what kind of normalization I use. I haven't tried Instance norm. Layer norm yields often some worse results (it trains well, but seems to converge to worse saddle points for whatever reason). In fact, I get the best results without any kind of normalization. But for conditional, I still get better results with conditional batch norm (rather than conditional layer norm, or no normalization and concatenating the labels to the input noise).
I have to say that different datasets yield very different behaviors in terms of stability, even though I normalize them in the same way. I can't figure out why, though. I really have to tune the learning rate and the batch size to get somewhat stable training. One last important thing I forgot to mention, and which should have a big impact on the normalization, is that both my critic and generator networks are Fully-Connected. My understanding is that FC benefits less than CNN from normalizations.
And I realize I forgot something that could really be useful to you. For me, when it comes to the stability of the training, the game changer has been to switch from Adam to AMSGrad (coming from a recent paper by Reddi et al.) which fixes some convergence issues of Adam. I used that implementation : https://github.com/taki0112/AMSGrad-Tensorflow
Hello,
I've been working for a while on a WGAN-GP with conditioning on cluster indexes (for biological data). I tried a lot of things, but with ACGAN, for instance, I was getting great results in terms of Wasserstein distance but the cluster indexes were all mixed up. My preliminary results with the projection are amazing. Thanks a lot for the idea (which actually sounds much more natural than ACGAN, and very reminiscent of conditional batch norm, for instance).
However, I have a few questions about implementation and whether you have some insights to share. In a first (pretty successful attempt), I used the embedding as the last layer of my WGAN critic. I notice that in your snresnet_small and snresnet_64 implementations, the output of your critic is the sum of the output of the embedding and the output of a more "classic" linear layer. Why keeping the linear layer at all? Am I missing something? Also in the snresnet implementation, you use the embedding in one of the middle layers. Any insight on why / when it's better to embed on a middle layer rather than the output layer?
(And by the way, even a very vanilla implementation of the embedding on the final layer of an MLP WGAN critic (without any spectral normalization) works amazing and better than anything else I had tried so far, regardless of whether I use conditional batch norm, conditional layer norm or more vanilla concatenation of the conditions in the generator.)