Closed raspstephan closed 3 years ago
@annavaughan I am just looking through your esr_gan notebook and I am a little confused by the Generator architecture. From what I understand this means that the same convolution (self.conv2), like the same weights, is used for blocks 2-7. Isn't that kind of weird? Same for self.bn.
I am training our most advanced GAN on the first GPU VM. I created another GPU VM nwp-downscale-gpu2
, which should work (remember, you have to add your user again). You can find the IP on the Azure portal. JIT is also enabled, so you have to request access as usual. @HirtM @annavaughan
So some first results from training the GAN with hinge loss, etc for a longer time (like 60 epochs) @HirtM
First the settings:
I also pretained the generator using the MSE loss for two epochs.
One thing I noticed is that for most batches the discriminator loss is zero. This probably happens because of the ReLU in the Hinge loss. I would like to try the regular Wasserstein loss as well to see if that would help. Otherwise it might make sense to train the generator more.
@HirtM I also made some minor changes in the 08 notebook like adding a plotting function that can be called after each epoch and saving the disc and gen. It's currently on the stephans_gan branch. There is a merge conflict which I didn't have time to fix right now.
I tried training with the Wasserstein loss in the Experiments/01-WGAN notebook on my gan branch.
I think in the Wasserstein loss the absolute value of the discriminator loss should get smaller which it doesn't. The results do show that there is some learning going on though.
One thing I just realized though is that I have spectral normalization AND the gradient penalty on. Hmm...
I have just looked at the "weird" structures in a bit more detail. I don't know if there is any valuable information, but I might as well document it: In my example at least, every fourth grid point seems to be strongly correlated:
It might be different for other cases though.
Also following https://medium.com/@hirotoschwert/introduction-to-deep-super-resolution-c052d84ce8cf (see e.g. Fig. 3), it also sounds reasonable to me to try and replace the pixelshuffler, as you @raspstephan already suggested.
Here is another explanation for checkerboard patterns related to not matching strides and filter sizes in a deconvolution, but as far as I understand it, this should not apply to us: https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215
I implemented a bunch of changes to the network architecture to make it more similar to the Leinonen Paper. Well, the GAN finally does something, see here: Now, of course, I have no idea why it "works" now because there were too many changes at once. Here are the settings:
I am using a log-transform. I have a gut feeling that that made a big difference.
Different from Leinonen, I am using spectral norm in generator AND discriminator and I am not using any l2 regularization in the generator.
I am pretraining.
Using the Wasserstein loss with gradient penalty and 5 disc steps per gen step. I am using the l1 loss as well for the generator but it doesn't do anything, see below.
The generator loss fluctuates wildly. The disc loss does what it should do.
While the GAN produces something interesting, it's still far away from looking realistic. The fact that the generator loss is so wild and that the L1 loss has no impact doesn't seem right. So the question is: why is the generator loss so crazy? Why is the distribution of gen_preds_fake so different from disc_preds_fake? Shouldn't they be approximately the same?
Also interesting: it takes around 3-4 epochs (2h) to see where we are going.
Two tests:
I am closing this issue and opening another one.
Some things to try out @HirtM :