raspstephan / nwp-downscale

MIT License
36 stars 8 forks source link

First GAN #29

Closed raspstephan closed 3 years ago

raspstephan commented 3 years ago

Some things to try out @HirtM :

raspstephan commented 3 years ago

@annavaughan I am just looking through your esr_gan notebook and I am a little confused by the Generator architecture. image From what I understand this means that the same convolution (self.conv2), like the same weights, is used for blocks 2-7. Isn't that kind of weird? Same for self.bn.

raspstephan commented 3 years ago

I am training our most advanced GAN on the first GPU VM. I created another GPU VM nwp-downscale-gpu2, which should work (remember, you have to add your user again). You can find the IP on the Azure portal. JIT is also enabled, so you have to request access as usual. @HirtM @annavaughan

raspstephan commented 3 years ago

So some first results from training the GAN with hinge loss, etc for a longer time (like 60 epochs) @HirtM

First the settings:

image image

I also pretained the generator using the MSE loss for two epochs.

image image image

One thing I noticed is that for most batches the discriminator loss is zero. This probably happens because of the ReLU in the Hinge loss. I would like to try the regular Wasserstein loss as well to see if that would help. Otherwise it might make sense to train the generator more.

raspstephan commented 3 years ago

@HirtM I also made some minor changes in the 08 notebook like adding a plotting function that can be called after each epoch and saving the disc and gen. It's currently on the stephans_gan branch. There is a merge conflict which I didn't have time to fix right now.

raspstephan commented 3 years ago

I tried training with the Wasserstein loss in the Experiments/01-WGAN notebook on my gan branch. image

I think in the Wasserstein loss the absolute value of the discriminator loss should get smaller which it doesn't. The results do show that there is some learning going on though.

image image

One thing I just realized though is that I have spectral normalization AND the gradient penalty on. Hmm...

HirtM commented 3 years ago

I have just looked at the "weird" structures in a bit more detail. I don't know if there is any valuable information, but I might as well document it: In my example at least, every fourth grid point seems to be strongly correlated: image image

It might be different for other cases though.

Also following https://medium.com/@hirotoschwert/introduction-to-deep-super-resolution-c052d84ce8cf (see e.g. Fig. 3), it also sounds reasonable to me to try and replace the pixelshuffler, as you @raspstephan already suggested.

Here is another explanation for checkerboard patterns related to not matching strides and filter sizes in a deconvolution, but as far as I understand it, this should not apply to us: https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215

raspstephan commented 3 years ago

I implemented a bunch of changes to the network architecture to make it more similar to the Leinonen Paper. Well, the GAN finally does something, see here: image Now, of course, I have no idea why it "works" now because there were too many changes at once. Here are the settings:

I am using a log-transform. I have a gut feeling that that made a big difference. image

Different from Leinonen, I am using spectral norm in generator AND discriminator and I am not using any l2 regularization in the generator. image

I am pretraining. image

Using the Wasserstein loss with gradient penalty and 5 disc steps per gen step. I am using the l1 loss as well for the generator but it doesn't do anything, see below. image

The generator loss fluctuates wildly. The disc loss does what it should do. image image image

While the GAN produces something interesting, it's still far away from looking realistic. The fact that the generator loss is so wild and that the L1 loss has no impact doesn't seem right. So the question is: why is the generator loss so crazy? Why is the distribution of gen_preds_fake so different from disc_preds_fake? Shouldn't they be approximately the same?

Also interesting: it takes around 3-4 epochs (2h) to see where we are going.

raspstephan commented 3 years ago

Two tests:

  1. No Gradient penalty (because I thought using spectral norm in both G and D already does enough). doesn't Work at all. Losses explode. Images not good.

image image

  1. Using hinge loss. I mean, not catastrophic but also not good. image image
raspstephan commented 3 years ago

I am closing this issue and opening another one.