Open shuzhangcasia opened 7 years ago
Yes, the trick states that you should train D on one mini-batch of only real samples and one mini-batch of only synthetic samples. Why this performs better, I do not know.
@spurra Thanks for the reply. In practice do I need to train in this fashion (Train D (positive) -> Train G -> Train D (negative)) ? Or do I need to Train D(positive)->Train D(negative -> Train G?
@shuzhangcasia Train D(positive)->Train D(negative) -> Train G makes more sense, as you're training first D completely and then G can learn from D and I haven't seen the first order you mentioned. That does not mean it would not work :)
I tried to alternate D(positive) and D(negative) with G training, and the resulting Gan is wildly oscillating. I got good results by training D(positive) and D(negative) each time before G train.
This trick is working for me. However do you have any reference or ideas why putting real and fake examples in the same batch does not work? Thanks :D
@soumith Do you have any explanation as to why pooling samples is not recommended?
The batchnorm is a very tricky layer: after each forward pass through the discriminator D, the layer changes, namely its exponentially moving average statistics accumulators get updated. Therefore calling D(real) and then D(fake) give forward passes through slightly different networks. I suspect that by doing this some extra information about the synthetic / real samples could be involuntarily leaked to the discriminator through batchnorm's statistic accumulators.
I made a simple experiment in theano/lasagne: used simple 4-layer GANs to train a generator for scikit's circles dataset. There were 10 updates of the discriminator per 1 update of the generator.
The networks without BN layers train slowly, but in the end the generator wins. After introducing BN layers and first feeding real samples D(X) and then synthetic ones D(G(Z)), every experiment ended in the discriminator completely defeating the generator (also the output of the generator was wildly unstable). Tuning the number of updates didn't solve the problem.
To remedy this, and having observed the global effect of batchnorm layer, I pooled the real and fake samples (lasagne's ConcatLayer along the batch axis), fed the joint batch though the discriminator, and then split the D's output accordingly. This resulted in both a speed up in training, and a winning generator.
I wonder at how one would implement this trick into code, e.g. TensorFlow. Having a loss like this
disc_loss = -tf.reduce_mean(tf.log(disc_corpus_prediction) + tf.log(1 - disc_from_gen_prediction))
seems to be unintuitive at how to split this loss function to its parts. Does anyone have a small example of how to do this ?
As far as I think the reason for this trick to work is partly described in this paper especially in section 3.2.
My discriminator is unable to learn anything when I create two separate batches, even if I don't update the generator at all...
@vojavocni typical of a bad implementation. Check your code, your error is not in the loss
Thanks for your very insightful tricks for training the GAN.
But I have problem understanding the first trick in 4 (Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images.)
Do you suggest that instead of training D with 1:1 pos&neg examples in each mini-batch as done in DCGAN (https://github.com/carpedm20/DCGAN-tensorflow), we should train the D with only pos or neg examples in each mini-batch? Why should this work?